iqbal-lab-org / minos

Variant call adjudication
MIT License
16 stars 5 forks source link

How to build a manifest.tsv file. Can you provide a sample file? #116

Open joybio opened 2 years ago

joybio commented 2 years ago

comma or tab separated? how about pairwise sequencing? can I use compressed files, like gz or tar.gz?

here is an example of my manifest.tsv, is this ok?

######################################################################### name vcf reads reads 201550 NGS_Mapping/SNP/gvcf/201550.gvcf.gz 01.Cleandata/201550/201550.1.fq.clean.gz 01.Cleandata/201550/201550.2.fq.clean.gz 183207 NGS_Mapping/SNP/gvcf/183207.gvcf.gz 01.Cleandata/183207/183207.1.fq.clean.gz 01.Cleandata/183207/183207.2.fq.clean.gz #########################################################################

thanks.

martinghunt commented 2 years ago

It's a tab-delimited file, as described here: https://github.com/iqbal-lab-org/minos/wiki/Minos-for-joint-genotyping

joybio commented 2 years ago

thanks for the links! but I got another error as followed:

Command error: Traceback (most recent call last): File "/share/data3//miniconda3/envs/minos/bin/minos", line 8, in sys.exit(main()) File "/share/data3//miniconda3/envs/minos/lib/python3.9/site-packages/minos/main.py", line 300, in main args.func(args) File "/share/data3//miniconda3/envs/minos/lib/python3.9/site-packages/minos/tasks/vcf_cluster.py", line 6, in run tracker.cluster( File "/share/data3//miniconda3/envs/minos/lib/python3.9/site-packages/cluster_vcf_records/variant_tracking.py", line 672, in cluster split_ranges = self.find_parallel_cluster_split_points(max_ref_length, cpus) File "/share/data3/*/miniconda3/envs/minos/lib/python3.9/site-packages/cluster_vcf_records/variant_tracking.py", line 667, in find_parallel_cluster_split_points split_ranges.append([split_ranges[-1][-1] + 1, len(self.variants) - 1]) IndexError: list index out of range ############################################## my manifest.tsv:

name reads vcf 201550 /home/NGS/201550.dup.bam /home/NGS/201550.gvcf.gz ... ###############################################

martinghunt commented 2 years ago

Is it a lot of samples? If possible could you share the VCF files with me please and I can debug

joybio commented 2 years ago

it's about 1000 vcf files. the following text is the head of vcf files (all like this). #################################################################

description

...

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 180233

NZ_CP042314.1 1 . C . . END=87 GT:DP:GQ:MIN_DP:PL 0:202:99:58:0,1800 NZ_CP042314.1 88 . T C, 13506.04 . DP=356;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=1281600,356 GT:AD:DP:GQ:PL:SB NZ_CP042314.1 89 . C . . END=528 GT:DP:GQ:MIN_DP:PL 0:602:99:345:0,1800 NZ_CP042314.1 529 . A C, 28517.04 . DP=647;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=2329200,647 GT:AD:DP:GQ:PL:SB NZ_CP042314.1 530 . T . . END=543 GT:DP:GQ:MIN_DP:PL 0:630:99:624:0,1800 NZ_CP042314.1 544 . G A, 28195.04 . DP=645;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=2322000,645 GT:AD:DP:GQ:PL:SB NZ_CP042314.1 545 . C . . END=650 GT:DP:GQ:MIN_DP:PL 0:577:99:529:0,1800 NZ_CP042314.1 651 . A C, 21344.04 . DP=541;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=1947600,541 GT:AD:DP:GQ:PL:SB NZ_CP042314.1 652 . G . . END=1038 GT:DP:GQ:MIN_DP:PL 0:597:99:494:0,1800 NZ_CP042314.1 1039 . A G, 19547.04 . DP=512;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=1843200,512 GT:AD:DP:GQ:PL:SB NZ_CP042314.1 1040 . C . . END=1369 GT:DP:GQ:MIN_DP:PL 0:562:99:439:0,1800 NZ_CP042314.1 1370 . C A, 22301.04 . DP=580;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=2088000,580 GT:AD:DP:GQ:PL:SB NZ_CP042314.1 1371 . G . . END=2005 GT:DP:GQ:MIN_DP:PL 0:542:99:503:0,1800 ... 215663 lines ....

i'm reading this function (find_parallel_cluster_split_points) , what information are we processing in this function. Does the str "" affect?

thanks!

martinghunt commented 2 years ago

I've just tested and having <NON_REF> for an allele will break the pipeline later on (the gramtools_build stage). I expect bad things will also be happening internally before the actual failure.

There's also two nextflow parameters that may be causing your VCFs to be excluded. Their defaults made sense for TB, but maybe not for your data.

  1. --max_variants_per_sample. Default is 5000. If a VCF has more records than this, then it is excluded (on the basis that it must be too far away from the reference genome)
  2. --max_genome_proportion_per_sample. Default is 0.05. This is saying (approximately) that if the sample is more than 5% away from the reference, then it is excluded. The actual test is: (total length of REF alleles) / (genome size) > 0.05.

Sorry, these should be documented - is on my to do list.

joybio commented 2 years ago

thank you very much!

what should i do with "", can you add a if "sentence" to process this?

martinghunt commented 2 years ago

I'll add in a step that removes all non-ACGT alleles. You could wait for that fix, or remove the <NON_REF> alleles from your VCF files.

joybio commented 2 years ago

can i reset these two params? or just add these two params in the nextflow command lines?

I've just tested and having <NON_REF> for an allele will break the pipeline later on (the gramtools_build stage). I expect bad things will also be happening internally before the actual failure.

There's also two nextflow parameters that may be causing your VCFs to be excluded. Their defaults made sense for TB, but maybe not for your data.

  1. --max_variants_per_sample. Default is 5000. If a VCF has more records than this, then it is excluded (on the basis that it must be too far away from the reference genome)
  2. --max_genome_proportion_per_sample. Default is 0.05. This is saying (approximately) that if the sample is more than 5% away from the reference, then it is excluded. The actual test is: (total length of REF alleles) / (genome size) > 0.05.

Sorry, these should be documented - is on my to do list.

joybio commented 2 years ago

thanks, i'll wait for the update.

I'll add in a step that removes all non-ACGT alleles. You could wait for that fix, or remove the <NON_REF> alleles from your VCF files.

martinghunt commented 2 years ago

can i reset these two params? or just add these two params in the nextflow command lines?

Add them to the nextflow run command, like nextflow run --max_variants_per_sample 1000000 ...etc.

joybio commented 2 years ago

ok.

can i reset these two params? or just add these two params in the nextflow command lines?

Add them to the nextflow run command, like nextflow run --max_variants_per_sample 1000000 ...etc.

joybio commented 2 years ago

hi. i used 1 vcf file as input, replace("",".") in the vcf and add "--max_variants_per_sample 100000000 --max_genome_proportion_per_sample 1" to the nextflow run command, got the same error too.

martinghunt commented 2 years ago

can you share the vcf file?

joybio commented 2 years ago

#

description

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 201550

NZ_CP042314.1 1 . C . . . END=392 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0 NZ_CP042314.1 393 . T . . . END=399 GT:DP:GQ:MIN_DP:PL 0:4:99:4:0,135 NZ_CP042314.1 400 . C . . . END=403 GT:DP:GQ:MIN_DP:PL 0:4:90:4:0,90 NZ_CP042314.1 404 . G . . . END=404 GT:DP:GQ:MIN_DP:PL 0:4:45:4:0,45 NZ_CP042314.1 405 . A . . . END=1199 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0 NZ_CP042314.1 1200 . A . . . END=1209 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42 NZ_CP042314.1 1210 . C . . . END=1228 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1229 . G . . . END=1231 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 1232 . G . . . END=1233 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1234 . G . . . END=1237 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 1238 . G . . . END=1239 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1240 . C . . . END=1243 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 1244 . G . . . END=1244 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1245 . G . . . END=1245 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 1246 . G . . . END=1248 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1249 . T . . . END=1250 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 1251 . A . . . END=1253 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1254 . C . . . END=1255 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 1256 . C . . . END=1256 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45 NZ_CP042314.1 1257 . A . . . END=1257 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0 NZ_CP042314.1 7892 . C G,. 0.01 . BaseQRankSum=0.000;DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-0.674;RAW_MQandDP=5200,2;ReadPosRankSu m=0.674 GT:AD:DP:GQ:PL:SB 0:1,1,0:2:0:0,0,45:1,0,0,1 NZ_CP042314.1 7893 . C . . . END=7893 GT:DP:GQ:MIN_DP:PL 0:2:72:2:0,72 NZ_CP042314.1 7894 . G GT,. 0 . BaseQRankSum=0.000;DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-0.674;RAW_MQandDP=5200,2;ReadPosRankSu m=0.674 GT:AD:DP:GQ:PL:SB 0:1,1,0:2:0:0,0,45:1,0,0,1

############################################

there is only 1 difference: the character "" was transformed to "."

martinghunt commented 2 years ago

I think the cause is that all of the records are getting ignored either because there's no GT field, or because the GT calls are reference, not alt. I've added a description of which alleles in the input get used here: https://github.com/iqbal-lab-org/minos/wiki/Minos-in-single-sample-mode#input-vcf-files

I've also made a toy data set so you can test the installation. Instructions are here: https://github.com/iqbal-lab-org/minos/wiki/Minos-test-data

You'll need to update to the new release 0.12.4 to get the toy data, and that version will also handle the <xyz> style alleles.

joybio commented 2 years ago

error messages (test data): 1, gramtools build: error: the following arguments are required: -o/--gram_dir

i have figured it out. line 113 in nextflow/regenotype.nf. --gram-dir ==> --gram_dir --kmer-size ==> --kmer_size ############################################################# 2, [4f/75d978] process > parse_manifest [100%] 1 of 1 ✔ [c2/b9a58e] process > make_vcf_for_gramtools:vcf_merge [100%] 1 of 1 ✔ [41/e27f38] process > make_vcf_for_gramtools:vcf_cluster [100%] 1 of 1 ✔ [0d/109185] process > gramtools_build [100%] 1 of 1 ✔ [1d/1ce400] process > minos (1) [100%] 2 of 2, failed: 2 ✔ [- ] process > make_per_sample_vcfs_dir - [- ] process > ivcf_merge_chunks - [- ] process > ivcf_final_merge - [f1/634237] NOTE: Process minos (2) terminated with an error exit status (1) -- Error is ignored [1d/1ce400] NOTE: Process minos (1) terminated with an error exit status (1) -- Error is ignored

log

[minos 12-08-2022 17:47:26 INFO] Depencency check: vcfbreakmulti Unknown vcfbreakmulti [minos 12-08-2022 17:47:26 INFO] Depencency check: vcfallelicprimitives Unknown vcfallelicprimitives [minos 12-08-2022 17:47:26 INFO] Depencency check: vcfuniq Unknown vcfuniq [minos 12-08-2022 17:47:26 INFO] Depencency check: vt NOT_FOUND /share/data1/*/soft/miniconda3/envs/minos/bin/vt

I'm sure that vt is ok. what should I do?

joybio commented 2 years ago

another question: what should I do when GT calls are reference, not alt. ######################################################## my vcf: NZ_CP042314.1 1 . C . . . END=392 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0 NZ_CP042314.1 393 . T . . . END=399 GT:DP:GQ:MIN_DP:PL 0:4:99:4:0,135 your test: ref_name 100 . C A . . . GT 0/0 ref_name 101 . C . . . GT 1/1 ref_name 102 . T A,G . . . GT 0/1 ref_name 103 . G A,C, . . . GT 1/2

martinghunt commented 2 years ago

The vt line in the log looks strange: I wouldn't expect a * in that path. What does which vt return for you?

For ref calls - you'll need to change the GT to be allele that you want considered for genotyping. But also in your VCF the ALT column has . - this will need to be changed to a nucleotide sequence.