stat-lab / MOPline

Detection and genotyping of structural variants
MIT License
14 stars 5 forks source link

No match with chr1 between the reference index file and the input vcf file #11

Open jingydz opened 5 months ago

jingydz commented 5 months ago

image I followed the structure of mopline in the figure.

Why I got the information: "No match with chr1 between the reference index file and the input vcf file", and my "sample.Merge.ALL.vcf" file changed into empty.

stat-lab commented 5 months ago

Which step did the error occur? Please show your command.

jingydz commented 5 months ago

first step: mopline merge_7tools --sample samplename -rl 150 second step: mopline create_cov \ -b /xxx/samplename/samplename.bam \ -r /xxx/hg38.fa \ -rl 150 -n 4 third step: mopline add_cov --sample_list samplename.list --toolset 7tools --vcf_dname Merge_7tools -n 2

jingydz commented 5 months ago

less samplename/Manta/Manta.samplename.vcf

ALT=

ALT=

ALT=

ALT=

The header mentioned above is from the VCF result file of the sample processed by the Manta software. After running Manta, I also used the convertInversion.py script to convert some BNDs into INVs. Is this the correct input for mopline?

stat-lab commented 5 months ago

I suppose your data is human data. By default, the human reference build is GRCh37 (specify --build 38 if you use build 38). Is the chr name 1, 2, ... X in your vcf file ?

jingydz commented 5 months ago

Yes, there are chr1, 2, 3, ... in my vcf file.

Thank you.

I run the command "mopline add_cov --sample_list samplename.list --toolset 7tools --build 38 --vcf_dname Merge_7tools -n 2". and it appears that my wham file has not been recognized.

jingydz commented 5 months ago

image Is the VCF file input for WHAM not containing genotypes?

stat-lab commented 5 months ago

Your reference genome seems GRCh38 or hg19. If hg19 is used, remove 'chr' prefix of the chromosome name in all your data, including vcf, fasta, and bam. If you use GRCh38 reference, add --build 38 option to your command.

stat-lab commented 5 months ago

Your Wham vcf contains genotype information.

To solve your previous error, an index file of your reference fasta used may be specified with --ref_index option.

jingydz commented 5 months ago

“Your Wham vcf contains genotype information.” The results from running the Wham software do not include genotypes, which I noticed is the same as in your example with my data, also lacking genotypes. By adding the --build 38 option, I have successfully run mopline add_cov and obtained the updated result file samplename.Merge.ALL.vcf.

fileformat=VCFv4.0

fileDate=20240126

source=MOPline-1.7

reference=hs38

INFO=

INFO=

INFO=

INFO=

ALT=

ALT=

ALT=

ALT=

ALT=

ALT=

ALT=

...

ALT=

ALT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT= 0.8 for DEL, < 1.1 for DUP) in DEL and DUP">

FORMAT=

contig=

contig=

...

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT samplename

chr1 14109813 . . . PASS SVTYPE=DEL;SVLEN=-1933;END=14111745;READS=3;TOOLS=CNVnator:14110501:2000:3,Manta:14109813:1933:3 GT:GQ:VP:VL:DR:DS:SR ./.:0:14109813:1933:1.14:0.21:10

jingydz commented 5 months ago

Hi, I have another question. After running CNVnator, I got the cnv file and vcf file by cnvnator2VCF.pl. For example, there are 7160 records in sample.cnv, and the same number of variant sites are also present in the VCF file. But after using MOPline/scripts/run_SVcallers/convert_CNVnator_vcf.pl, there are only 2645 records in the cnvnator.sample.vcf.

During this conversion process, has too many variant sites been filtered out?

stat-lab commented 5 months ago

The convert_CNVnator_vcf.pl script filters variants with ambiguous read depth differences in the output data of CNVnator . The details of the filtering can be seen in the Supplementary Method of our previous paper (Kosugi et al., Genome Biol 2019).

jingydz commented 5 months ago

Hi, Where can I find a detailed explanation of the following file, such as what the second column represents as positions with a step size of 50bp, and what do the other columns represent?

$ less sample/Cov/sample.chr1.cov.gz |head -n 3 chr1 10000 19.3 4.6 9 0 chr1 10050 39.2 13.7 1 0 chr1 10100 21.2 12.5 1 30

stat-lab commented 5 months ago

Each column represents chr, pos, cov, cov2, split5, and split3 (cov: read coverage of the 50-bp region, cov2: read coverage of reads with mapping quality of > 0 (by default), split5: the number of 5'-clipped ends of aligned reads in the 50-bp region, split3: the number of 3'-clipped ends of aligned reads in the 50-bp region).

jingydz commented 5 months ago

Thank you so much for your quickly response!

I have another question. When I executed create_cov, I have a question about the runtime in your table. Why does the runtime increase when the number of cores increases? image

When I run it using my data, I find the result is different with yours. My bam file size is 44G. When I used one core, it cost 1h50min for the first time, and it used 1h42min for the second time. When I used two cores, it cost 54min. When I used three cores, it cost 41min. When I used four cores, it cost 33min. When I used five cores, it cost 29min.

This is just the time required for running a single sample. When I use mopline's create_cov for multiple samples simultaneously, the time for each sample increases exponentially. Can you help me explain why this is happening?

Additionally, I noticed that in your README, the parameter for jointcall seems to be incorrect? It should be --md rather than --id. image

stat-lab commented 5 months ago

The create_cov with multi-threads treats the genome per chromosome. So, Specifying a thread number greater than the chromosome number is invalid. For multiple samples, I recommend you to run multiple jobs with subsets of samples but not to run a job sequentially for all samples.

The -id option for joint call is incorrect, but this has been fixed in the latest README on the GitHub.

jingydz commented 5 months ago

So, when I run add_cov, can I also split the entire sample_list into several smaller subsets of sample_list?

For example: mopline add_cov --sample_list sample.list.1 --toolset 7tools --build 38 --vcf_dname Merge_7tools -n 4; mopline add_cov --sample_list sample.list.2 --toolset 7tools --build 38 --vcf_dname Merge_7tools -n 4; mopline add_cov --sample_list sample.list.3 --toolset 7tools --build 38 --vcf_dname Merge_7tools -n 4... and so on.

stat-lab commented 5 months ago

Yes. If you have a job management system such as LSF and Slum, you can use the scripts, run_batch_LSF.pl, run_batch_slurm.pl, and so on. If not, you can process the jobs in batches (e.g., nohub mopline add_cov --sample_list sample.list.1 --toolset 7tools --build 38 --vcf_dname Merge_7tools -n 4 &).

jingydz commented 3 months ago

Hello, When selecting the input software for MOPline, I only had the results from Manta, CNVnator, and Wham softwares. Given that MOPline requires inputs from seven software, I utilized only three of them, and for the remaining four softwares, I provided empty files for various reasons. I am wondering if this would still be acceptable?