tjiangHIT / cuteSV

Long read based human genomic structural variation detection with cuteSV
MIT License
250 stars 36 forks source link

Force Calling Error (CuteSV + SURVIVOR) #140

Closed kchilkert closed 8 months ago

kchilkert commented 8 months ago

Hello,

I am interested in merging a trio of individuals variant sets along with merging of associated genotypes. I have found this step-by-step writeup (force call recall option) to achieve this:

  1. Run cuteSV for each sample to generate sample specific SV callsets.
  2. Perform SURVIVOR to merge every single vcf into merged.vcf.
  3. Rerun cuteSV for each sample with -Ivcf merged.vcf (force calling step).
  4. Perform SURVIVOR again to merge every force called single vcf into final_merged.vcf.

Link: https://github.com/tjiangHIT/cuteSV/issues/14

Below is a summary of my workflow:

-For both ONT and PacBio sequencing technologies I use minimap2 to convert the fastq file to a SAM file. -I use samtools to convert the SAM to BAM file, I then sort and index the BAM file. -I run cuteSV as per suggestions for either ONT or PacBio Hifi, see below:

_cuteSV --max_cluster_bias_INS=1000 --diff_ratio_merging_INS=0.9 --max_cluster_bias_DEL=1000 --diff_ratio_merging_DEL=0.5 --genotype sorted_Mother_PB_mapped.bam hg38.fa MotherPBCuteSV.vcf /my/working/directory

-I do this for all 3 samples for PacBio and ONT as per step 1 above. -I then perform SURVIVOR on each sorted output vcf, as per suggestions, see below:

_SURVIVOR merge triofile_list.txt 1000 1 1 1 0 50 survivormergedlongread.vcf

-I then rerun cuteSV for each sample with -Ivcf force calling command, see below:

_cuteSV --max_cluster_bias_INS=1000 --diff_ratio_merging_INS=0.9 --max_cluster_bias_DEL=1000 --diff_ratio_merging_DEL=0.5 --genotype -Ivcf survivormerged_longread.vcf sorted_Mother_PB_mapped.bam hg38.fa MotherPB_CuteSVforced.vcf /my/working/directory

At this step, it begins running well, then I receive an error message:

multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/site-packages/cuteSV/cuteSV_forcecalling.py", line 574, in solve_fc_wrapper return solve_fc(*args) File "/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/site-packages/cuteSV/cuteSV_forcecalling.py", line 579, in solve_fc readsfile.seek(sigs_index["reads"][chrom]) KeyError: 'chr6_KI270797v1_alt' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/k/kchilkert/.conda/envs/nf-env/bin/cuteSV", line 1267, in run(sys.argv[1:]) File "/home/k/kchilkert/.conda/envs/nf-env/bin/cuteSV", line 1263, in run main_ctrl(args, argv) File "/home/k/kchilkert/.conda/envs/nf-env/bin/cuteSV", line 1095, in main_ctrl result = force_calling_chrom(args.Ivcf, temporary_dir, File "/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/site-packages/cuteSV/cuteSV_forcecalling.py", line 570, in force_calling_chrom result.update(x.get()[0]) File "/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value KeyError: 'chr6_KI270797v1_alt'

I have checked the merged VCF file from the SURVIVOR step (step 2) and see this SV called (chr6_KI270797v1_alt), along with many others like this alternate found on other chromosomes in the file. In that writeup I mentioned earlier I believe I found a similar error was noted, see below:

Traceback (most recent call last): File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/bin/cuteSV", line 4, in import('pkg_resources').run_script('cuteSV==1.0.10', 'cuteSV') File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 651, in run_script self.require(requires)[0].run_script(script_name, ns) File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 1448, in run_script exec(code, namespace, namespace) File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/lib/python3.7/site-packages/cuteSV-1.0.10-py3.7.egg/EGG-INFO/scripts/cuteSV", line 801, in run(sys.argv[1:]) File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/lib/python3.7/site-packages/cuteSV-1.0.10-py3.7.egg/EGG-INFO/scripts/cuteSV", line 797, in run main_ctrl(args, argv) File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/lib/python3.7/site-packages/cuteSV-1.0.10-py3.7.egg/EGG-INFO/scripts/cuteSV", line 672, in main_ctrl max_cluster_bias_dict, threshold_gloab_dict, args.gt_round, args.threads) File "/public/agis/huangsanwen_group/chenglin/softwares/miniconda3/lib/python3.7/site-packages/cuteSV-1.0.10-py3.7.egg/cuteSV/cuteSV_forcecalling.py", line 352, in force_calling read_id_list, max_cluster_bias = find_in_list(sv_type, search_id_list, max_cluster_bias_dict[sv_type], pos, sv_end) KeyError: 'NA'

It was mentioned that at the time(4/17/2021), "the current version of cuteSV can just detect DEL, INS, INV, DUP, and BND. When there is a call in the provided Ivcf file, which is with the type not in the detectable type list, cuteSV will crash due to the unrecognized SV type. Now I have updated a new function to skip the calls with undetectable SV type. Please git clone the latest version of cuteSV and rerun your commands to acquire the newest results."

I am running the latest version of cuteSV (2.1.0) and I believe this is the same problem at hand, cuteSV isn't recognizing the SV type but is also not ignoring it either.

Any work around for this?

Any help is appreciated, Kenneth

Meltpinkg commented 8 months ago

Hello, @kchilkert

Thanks for your clear description of the issue. We have fixed it and updated codes in GitHub. You can reinstall cuteSV via git clone https://github.com/tjiangHIT/cuteSV.git && cd cuteSV/ && python setup.py install Hope it will help!

Best, Shuqi

kchilkert commented 8 months ago

Hi @Meltpinkg,

I appreciate the response, I will give it a try and will let you know how it goes. I did want to mention that I found it odd that CuteSV was throwing errors for SV types it had called previously, I have some listed below from the #CHROM column in the output VCF file from CuteSV:

chr6_KI270798v1_alt chr9_KI270717v1_random chrM 1 chrUn_GL000195v1

I assumed if SV types like this were detected by CuteSV that they would be recognized when CuteSV is force calling after performing SURVIVOR.

-Kenneth

Meltpinkg commented 8 months ago

Hi @kchilkert

I should explain that the above mentioned error KeyError: 'chr6_KI270797v1_alt' is caused by the missing of valid reads in this chromosome, which makes the SVs on this chromosome cannot find their corresponding reads. And yesterday we fixed it and now cuteSV would successfully detect the missing valid reads and finish force calling.

For the SV types errors, I'm afraid that I didn't understand it clearly. cuteSV can detect SVs of five types (insertions, deletions, inversions, duplications, and translocations). Also it can force call the above five types of SVs (for translocations, both SVTYPE=TRA and SVTYPE=BND will be recognized). So did cuteSV fail to call these types of SVs, or there are some other problems in different SV types? Maybe a more detailed example will help.

Best, Shuqi

kchilkert commented 8 months ago

Hi @Meltpinkg,

I see, I misunderstood exactly what the error I received fully meant. With that, I retried after running the git clone command to no avail, I also tried uninstalling and reinstalling which made no difference as well. I decided to try a different tactic, I have this file that is full of a lot of "junk" like what the KeyError references to. A colleague and I were able to work around this issue using bcftools, basically, I wanted to filter this "junk" out of the file that is really of no use to me and which is causing CuteSV to crash. In my case, this was accomplished as follows:

bgzip survivormerged_longread.vcf
bcftools sort survivormerged_longread.vcf.gz -Oz -o survivormerged_longread_sorted.vcf.gz
bcftools index survivormerged_longread_sorted.vcf.gz
bcftools view -r chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY survivormerged_longread_sorted.vcf.gz -Oz -o survivormerged_longread_sorted_updated.vcf.gz

Once that was complete, I had a clean SURVIVOR merged VCF file, which included chr1-22, X, Y to feed to CuteSV which ran perfectly. In all honesty, this may have been necessary for me to do for each of the output VCF files from CuteSV before I fed them to SURVIVOR, but in any case this seemed to work well for my purposes.

I do not know if this is a suitable workaround for future users, especially if a user is interested in what I considered "junk" in my file, but I wanted to let you know how the issue was resolved.

I did want to note that when I run CuteSV force calling, the output VCF of the command has the chromosomes out of order again, below is the command I'm using for PacBio HiFi data:

cuteSV --max_cluster_bias_INS=1000 --diff_ratio_merging_INS=0.9 --max_cluster_bias_DEL=1000 --diff_ratio_merging_DEL=0.5 --genotype -Ivcf survivormerged_longread_sorted_updated.vcf sorted_Father_PB_mapped.bam hg38.fa FatherPB_CuteSV_forced.vcf /my/working/directory

Please note that the VCF file I feed CuteSV in that command has all chromosomes in order.

I do appreciate your time and your help, Kenneth

Meltpinkg commented 8 months ago

Hi @kchilkert

When the SVs on the small contigs like "chr6_KI270797v1_alt" are unnecessary, it is a good idea to filter them with bcftools. I'm glad to hear about the successfully resolved problem.

In addition, you mentioned that I retried after running the git clone command to no avail, I also tried uninstalling and reinstalling which made no difference as well. Is the installation through git clone & setup.py install failed? Or the newest version of cuteSV in GitHub still failed in detecting and through out the above KeyError?

For the issue of the order of cuteSV, cuteSV retains the position order of SVs the same as the input. As far as I am concerned, the output VCF of SURVIVOR is unordered. Maybe a sorting process of the Ivcf before force calling is necessary. I'd like to tell that I also think it may be more convenient to output sorted SVs instead of remaining its original order in Ivcf. I will fix it and update it soon.

Hope it will help.

Best, Shuqi

kchilkert commented 8 months ago

I ran the git clone & setup.py install command and it installed fine, no errors. When I subsequently tried to rerun cuteSV I received the following error:

Traceback (most recent call last):
  File "/home/k/kchilkert/.conda/envs/nf-env/bin/cuteSV", line 4, in <module>
    __import__('pkg_resources').run_script('cuteSV==2.1.0', 'cuteSV')
  File "/cm/shared/apps/miniconda3/4.9.2-py385/lib/python3.8/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/cm/shared/apps/miniconda3/4.9.2-py385/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1436, in run_script
    raise ResolutionError(
pkg_resources.ResolutionError: Script 'scripts/cuteSV' not found in metadata at '/home/k/kchilkert/.conda/envs/nf-env/lib/python3.8/site-packages/cuteSV-2.1.0.dist-info'

I then tried to uninstall and reinstall cuteSV, which installed fine. I tried to rerun again and received the same error I had before: KeyError: 'chr6_KI270797v1_alt'

-Kenneth

Meltpinkg commented 8 months ago

It may be better to create a new and clean environment to install cuteSV through git, like:

conda create -n cutesv_env python=3.10
conda activate cutesv_env
git clone https://github.com/tjiangHIT/cuteSV.git && cd cuteSV/ && python setup.py install

The source code on pypi and bioconda has not been updated, so installing via these platforms doesn't make changes.

Hope it will help.

Best, Shuqi

hehuaye commented 8 months ago

I face the same problem. Traceback (most recent call last): File "/data/dingsy/software/miniconda3/envs/cutesv_env/bin/cuteSV", line 4, in import('pkg_resources').run_script('cuteSV==2.1.0', 'cuteSV') File "/data/dingsy/software/miniconda3/envs/cutesv_env/lib/python3.11/site-packages/pkg_resources/init.py", line 720, in run_script self.require(requires)[0].run_script(script_name, ns) File "/data/dingsy/software/miniconda3/envs/cutesv_env/lib/python3.11/site-packages/pkg_resources/init.py", line 1559, in run_script exec(code, namespace, namespace)

Meltpinkg commented 8 months ago

Hi, @kchilkert @hehuaye

I have tested it on my environment and I found that in the site-package directory, when there is more than one package directory of the script, the system will mistake for the running script.

In detail, you can check the directory with the path like /home/{path of conda}/envs/{name of conda env}/lib/python3.8/site-packages/, whether there are two directories named cuteSV-2.1.0.dist-info and cuteSV-2.1.0-py3.8.egg. The latter directory is generated by setup install. Therefore, you can try to remove the former directory with the suffix "dist-info". Then cuteSV can search for the correct script.

In my test, when there are both cuteSV-2.1.0.dist-info and cuteSV-2.1.0-py3.8.egg directories in the site-packages directory, cuteSV will raise errors: Script 'scripts/cuteSV' not found in metadata. When I remove the former one by rm -r cuteSV-2.1.0.dist-info, cuteSV can run successfully. Hope it will help!

Best, Shuqi

hehuaye commented 8 months ago

Hi,@Meltpinkg I check my directory with the path /data/dingsy/miniconda3/envs/cutesv_env/lib/python3.10/site-packages,I only have one directory named cuteSV-2.1.0-py3.10.egg. I still can't solve my problem. The error is following: Traceback (most recent call last): File "/data/dingsy/miniconda3/envs/cutesv_env/bin/cuteSV", line 4, in import('pkg_resources').run_script('cuteSV==2.1.0', 'cuteSV') File "/data/dingsy/miniconda3/envs/cutesv_env/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script self.require(requires)[0].run_script(script_name, ns) File "/data/dingsy/miniconda3/envs/cutesv_env/lib/python3.10/site-packages/pkg_resources/init.py", line 1559, in run_script exec(code, namespace, namespace) File "/data/dingsy/miniconda3/envs/cutesv_env/lib/python3.10/site-packages/cuteSV-2.1.0-py3.10.egg/EGG-INFO/scripts/cuteSV", line 1266, in run(sys.argv[1:]) File "/data/dingsy/miniconda3/envs/cutesv_env/lib/python3.10/site-packages/cuteSV-2.1.0-py3.10.egg/EGG-INFO/scripts/cuteSV", line 1262, in run main_ctrl(args, argv) File "/data/dingsy/miniconda3/envs/cutesv_env/lib/python3.10/site-packages/cuteSV-2.1.0-py3.10.egg/EGG-INFO/scripts/cuteSV", line 992, in main_ctrl samfile = pysam.AlignmentFile(args.input) File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 997, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False

Meltpinkg commented 8 months ago

Hi @hehuaye

Sorry for my late reply. I missed the complete error messages. This error may come from the damaged BAM file making it unavailable to be opened with the pysam package. It is better to check the header of the BAM file together with the content.

Best, Shuqi

hehuaye commented 8 months ago

Hi, @Meltpinkg I've solved it. Thank you very much.