Open thkuo opened 6 years ago
[vcf2indel.py]
The current binary labels are ambiguous and can result in incorrect ML outcomes.
The 1s may mean insertions, deletions, missing values, or mixture of indels. For a certain indel feature, furthermore, is 1 1 1 0 0
equal to 0 0 0 1 1
? Are two features that have same values redundant?
Key questions to answer:
In the four filters of vcf2indel.py,
not existing
if ref == ".": ...
majority-based assignment
elif alt == ".":...
mixture of majority-based and reference-based methods
elif len(sample_names.loc[samples == "."]) > sample_names.shape[0]/2:...
reference-based
else:...
Only update the connection with snakemake main script (for now)
Even with the original methods, the indels were incorrectly detected in this case:
>CH2500
>F1659
>CH2502
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>CH2522
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGTGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>ESP088
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>MHH15083
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
and the vcf outcome was:
##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=561>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CH2500 CH2502 CH2522 ESP088 F1659 MHH15083
chrUn 0 . ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTGGTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTCTCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGCAGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTGGTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCTACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCGGCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAGGACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCGGTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGCCACCTGCGCCTGCAGGACTGA ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTGGTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTCTCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGCAGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTGGTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGTGCCGCTACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCGGCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAGGACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCGGTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGCCACCTGCGCCTGCAGGACTGA . . DP=4 GT:DP ./. 0/0:1 1/1:1 0/0:1 ./. 0/0:1
The gene was lost in two strains, but it wasn't detected by the old method. It could be listed in the gpa table though...
Problems to solve:
Strategy: