simonhmartin / genomics_general

General tools for genomic analyses.
343 stars 93 forks source link

ValueError: Sample NR112 at Scaffold_1:1 genotype ./././. does not match explected ploidy of 2 #90

Closed wangjie07070910 closed 1 year ago

wangjie07070910 commented 1 year ago

Thank you very much for such fantastic scripts, please, I have tetraploids in my sample, is it not working? I tried to use the script for Processing VCF files, here is my command: python VCF_processing/parseVCF.py -i input.vcf.gz --skipIndels --minQual 30 --gtf flag=DP min=5 max=50 -o output.geno.gz Yes, there are some samples in my vcf file that are tetraploids

simonhmartin commented 1 year ago

Hi, You can specify the ploidy in a file (first column sample ID, second column ploidy). Add the option --ploidyFile ploidy_file.txt

If you get errors, please post the error here so I can help diagnose it.

Simon

wangjie07070910 commented 1 year ago

Hi Simon,

Many thanks for your help, but I still encounter the error: parseVCF.py: error: argument --ploidy: invalid int value: 'ploidy_female.txt'. I presume that the content of my ploidy.txt file was not in the right format. The contents of my ploidy.txt file are as follows: sample ID ploidy Sample_1 2 Sample_2 4 Sample_3 4

Thanks again, Jie

simonhmartin commented 1 year ago

If all of your individuals are tetraploid, you can use --ploidy 4

If some of your individuals are diploid and some are tetraploid, use: --ploidyFile ploidy_file.txt

wangjie07070910 commented 1 year ago

Thanks again.

I used --ploidyFile ploidy_file.txt, and The contents of my ploidy.txt file are as follows:

sample_ID ploidy Sample_1 2 Sample_2 4 Sample_3 4

Then I got error: ValueError: invalid literal for int() with base 10: 'ploidy'

Also, I tried the ploidy_file.txt file without the table header:

Sample_1 2 Sample_2 4 Sample_3 4

Then I got error: IndexError: list index out of range

simonhmartin commented 1 year ago

Please check your ploidy file for empty lines. It sounds like the script is trying to read a line in the file that has no data in it.

wangjie07070910 commented 1 year ago

Thanks again. When I try to set my ploidy.txt file in the following format(When I turned 'ploidy' in the second column of the first row into a number), it worked

sample_ID 2 Sample_1 2 Sample_2 4 Sample_3 4

and I don't know if it has an effect. Besides, I'm having a new problem.

Error:Sample Sample_2 at Scaffold_1:1 genotype ./././. does not match explected ploidy of 2 (appears when I set Sample_2 to be a 2x.) Error:Sample Sample_2 at Scaffold_2:25 genotype ./. does not match explected ploidy of 4 (appears when I set Sample_2 to be a 4x.)

I know it's supposed to be a problem with my sample (it's supposed to be tetraploid), but I'm putting it here and I would appreciate if you could give your opinion. How should I preprocess a sample like this.

simonhmartin commented 1 year ago

Yes, this is a problem with your vcf, which includes incorrect formatting for some sites. You can add the option --ploidyMismatchToMissing to set these sites to missing data. In general, please remember that you can type parseVCF.py -h to see all the available options.