hallamlab / MetaPathways

A modular pipeline for constructing Pathway/Genome Databases from environmental sequence information
http://hallam.microbiology.ubc.ca/MetaPathways
12 stars 7 forks source link

Preprocessing alters sequences #32

Closed xapple closed 11 years ago

xapple commented 11 years ago

I finished installing the pipeline and started looking at the outputs produced. It seems that already at the first preprocessing step there is a problem. Is there any reason for which all sequences should have one nucleotide removed from their start ?

Here is the head of the regression_test.fasta file as it appears in the test data directory:

>RNA_LSU1
CTAATTCATGGCTAAGTGGGAAAGCAGGTGATACGACCAAAACAACCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAACAGCTCACTGGTCTAGATAAGTTGTGTTGCGGCGAAGATGTAACGGGGCTCAAGCCATGAGCCGAAGCTGAGGACACTGTAAAGTGTGGTAGCGGAGCGTTCCGTGATATAAATTCATGCGACTTTCAGTGCACTTCCAAGTGCGTTGATGTCGAATGAGTTTTTCTGTGAAGCCGGGCTGTAAGGCATCCGGTGGAGAGATCGGAAGTGAGAATGTTGACATGAGTAGCGATAAACAGGGTGAGAGACCCTGTCGCCGGAAGTCCAAGGGTTCCTGCTTAAAGCTAATCTGAGCAGGGTAAGCCGACCCCTAAGGCGAGGCCGAAAGGCGTAGTCGATGGGAACCAGGTTAATATTCCTGGGCCAGGAGAGTGTGACGGATCACGAGTGTAGTAAGGTCTTATTGGATTGATCTTGCTGCTTAGTGGTTCCTGGAAATAGCCCTCCATTATGATCGTACCCTAAACCGACACAGGTGGACAGGTAGAGTATACCAAGGCGCTTGAGAGAACTGTATTGAAGGAACTCGGCAAAATACCTCCGTAAGTTCGCGAGAAGGAGGCCCTATATTTAGGCAACTAGATGTAGGGGACACAACCCAGGGGGTGGCGACTGTTTACTAAAAACACAGGGCTCTGCGAAGCCGCAAGGCGATGTATAGGGTCTGACGCCTGCCCCGTGCCGGAAGGTTAAGAGGAGGTGTGCAAGCCCCCGAATTGAAGCCCCCGGGAAACGGCGGCCGGAACTATAACGGTCCTAAGGGAGCGAAATTCTTGGCCGGGAAGTTCCCACCTGCACAAAGGGGGTAA
>RNA_LSU2
CGACCTTGAAATAGTTGGGTGGCTACGGTGAGGGACNAACCTAGGCCCATTATCTTTGTCGGGGACAGAGTGTGGNGGGTAGTTTGACTGGGGCGGACTCCTCCCAAAGAGTAACGGAGGAGCACGAATGTGCGCTAATCATGGTCGGAAATCATGAGGTTAGTGTAAAGGCACAAGCGCGCTTGACTGTGAGTCTGACAAGACGAACAGGTACGAAAGTAGGTCTTAGTGATCCGGTGGTTCTGTATGGAAGGGCCATCGCTCAACGGATAAAAGGTACTCCGGGGATAACAGGCTGATACCGCCCAAGAGTTCACATCGACGGCGGTGTTTGGCACCTCGATGTCGGCTCATCACATCCTGGGGCTGAAGCCGGTCCCAAGGGTATGGCTGTTCGCCATTTAAAGTGGTACGCGAGCTGGGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGGCGTTGGAGAATTGAGGGAAGCTGCTCCTAGTACGAGAGGACCGGAGTGGACGAACCTCTGGTGTTCCGGTTGTGTCGCCAGACGCATTGCCGGGTAGCTACGTTCGGACAGGATAACCGCTGAAAGCATCTAAGCGGGAAGCCTCTCCCAAGACTAGTTCTCCCTAACCTCTTGAGGGTTCTTAAGAGCCGTCCAAGACTATGACGTTGATAGGCTGGATGTGGAAGCGTTGTCAGGCGTTGAGCTAACCACTACTAATTGCTCGTGCGGCTTGGACATATAATCCCAGTTCGTTAGGTGTGCTAAGCTCCCGCAGGAACACTCTGGGTGTTGTGACCAGTCCTTGGGAGGAGCAGGGTTTAATCGCCGGCAAAGGATTCCGGTTTTTTTTACCCGACCCCCTGTAGAAGGCCCTTGGAGCTGCCCGCAAACCACCTGTCATCTAAATCGAATTTCCCGGGCGACAACAGAGCCTTTGGAACCCCCCGGA
>RNA_LSU3
CCGGGAGGGATAACAGGGCTGACTACTGCACCAACATCGTTGCATTCAACTAAATTCCCTGCGGCATTGACTCCACTCTGCATGACGACTACGTCTCCGAAGCTGGTTAGTTCATCTAGTTCATCCTCCGTAGCTATATCAGCAACAGCGATTCCCTTTCTATTCCCCTTCAGCAAGCTTCCTATCAATGAGGACCCCCCTACCATGAATGGGTGCATTTCCAATTCAAATGTGGATTCTAGCTCATGAATAGCTTCGTTACTGAGAGAATGGGGGTGAAACAAAATGTCACCAACTACCGAGAGGTAAACGCCCACTTGACTGTGGCCCAGAATGTCTCCTGTAGAGATTCCCATAATGATCCCTCAAATTCCTGACCACTCTAGGACTTGTGATTGTTACTGATGTTTTGCGGGGCCCAGGGCGCCTCGCTGGGCGAGACGCCCTGGGGTGAAAGAAGAAGGCACAGCGACGAACGTTCCCGTGGGCTCTCGCACCACAGTACAGGAATCGACGCAGATGGGCTTGACTTCCGGGTTCGAAATGGGACCGGGTATTTCCCCATCGCTATGGCCGTGCACAACTTTCTTTCGAATGTGAATTGATATGCTTGATTGAACAGAGGCCCATACAGGGCTTAGTATGTGAATTCTTGATGCTCGGGGGATTAGTGCAGCTGGGCTGAACACCTCGGTCGAAACCTCGGTGCTTACACCCGCTGTCTATCAATCTCGTCTTTTACGAGTCCCCTGAGGTGCCTCGTCTTTCGGGCGACTTCGAGCTTAGATGCTTTCAGCTCTTATCCGCATGGAGCGTGGCTACCCAGCATTGCCTTGTCAGACAACTGGTAAACTAGTGGCTCCGAGCCCTCGTTCCTCTCGTACTAAAGGGCCCTTCCGATCAGGCACCTGATACGGCTCAACCACAAAGCACAAACCTGTCTCACGACGGTCTAAACCCATCTCAGATCCCCTTTAAATGGG

Here is the head of the regression_test.fasta file as it appears in the preprocessed directory:

>regression_test_0
TAATTCATGGCTAAGTGGGAAAGCAGGTGATACGACCAAAACAACCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAACAGCTCACTGGTCTAGATAAGTTGTGTTGCGGCGAAGATGTAACGGGGCTCAAGCCATGAGCCGAAGCTGAGGACACTGTAAAGTGTGGTAGCGGAGCGTTCCGTGATATAAATTCATGCGACTTTCAGTGCACTTCCAAGTGCGTTGATGTCGAATGAGTTTTTCTGTGAAGCCGGGCTGTAAGGCATCCGGTGGAGAGATCGGAAGTGAGAATGTTGACATGAGTAGCGATAAACAGGGTGAGAGACCCTGTCGCCGGAAGTCCAAGGGTTCCTGCTTAAAGCTAATCTGAGCAGGGTAAGCCGACCCCTAAGGCGAGGCCGAAAGGCGTAGTCGATGGGAACCAGGTTAATATTCCTGGGCCAGGAGAGTGTGACGGATCACGAGTGTAGTAAGGTCTTATTGGATTGATCTTGCTGCTTAGTGGTTCCTGGAAATAGCCCTCCATTATGATCGTACCCTAAACCGACACAGGTGGACAGGTAGAGTATACCAAGGCGCTTGAGAGAACTGTATTGAAGGAACTCGGCAAAATACCTCCGTAAGTTCGCGAGAAGGAGGCCCTATATTTAGGCAACTAGATGTAGGGGACACAACCCAGGGGGTGGCGACTGTTTACTAAAAACACAGGGCTCTGCGAAGCCGCAAGGCGATGTATAGGGTCTGACGCCTGCCCCGTGCCGGAAGGTTAAGAGGAGGTGTGCAAGCCCCCGAATTGAAGCCCCCGGGAAACGGCGGCCGGAACTATAACGGTCCTAAGGGAGCGAAATTCTTGGCCGGGAAGTTCCCACCTGCACAAAGGGGGTAA
>regression_test_1
GACCTTGAAATAGTTGGGTGGCTACGGTGAGGGACNAACCTAGGCCCATTATCTTTGTCGGGGACAGAGTGTGGNGGGTAGTTTGACTGGGGCGGACTCCTCCCAAAGAGTAACGGAGGAGCACGAATGTGCGCTAATCATGGTCGGAAATCATGAGGTTAGTGTAAAGGCACAAGCGCGCTTGACTGTGAGTCTGACAAGACGAACAGGTACGAAAGTAGGTCTTAGTGATCCGGTGGTTCTGTATGGAAGGGCCATCGCTCAACGGATAAAAGGTACTCCGGGGATAACAGGCTGATACCGCCCAAGAGTTCACATCGACGGCGGTGTTTGGCACCTCGATGTCGGCTCATCACATCCTGGGGCTGAAGCCGGTCCCAAGGGTATGGCTGTTCGCCATTTAAAGTGGTACGCGAGCTGGGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGGCGTTGGAGAATTGAGGGAAGCTGCTCCTAGTACGAGAGGACCGGAGTGGACGAACCTCTGGTGTTCCGGTTGTGTCGCCAGACGCATTGCCGGGTAGCTACGTTCGGACAGGATAACCGCTGAAAGCATCTAAGCGGGAAGCCTCTCCCAAGACTAGTTCTCCCTAACCTCTTGAGGGTTCTTAAGAGCCGTCCAAGACTATGACGTTGATAGGCTGGATGTGGAAGCGTTGTCAGGCGTTGAGCTAACCACTACTAATTGCTCGTGCGGCTTGGACATATAATCCCAGTTCGTTAGGTGTGCTAAGCTCCCGCAGGAACACTCTGGGTGTTGTGACCAGTCCTTGGGAGGAGCAGGGTTTAATCGCCGGCAAAGGATTCCGGTTTTTTTTACCCGACCCCCTGTAGAAGGCCCTTGGAGCTGCCCGCAAACCACCTGTCATCTAAATCGAATTTCCCGGGCGACAACAGAGCCTTTGGAACCCCCCGGA
>regression_test_2
CGGGAGGGATAACAGGGCTGACTACTGCACCAACATCGTTGCATTCAACTAAATTCCCTGCGGCATTGACTCCACTCTGCATGACGACTACGTCTCCGAAGCTGGTTAGTTCATCTAGTTCATCCTCCGTAGCTATATCAGCAACAGCGATTCCCTTTCTATTCCCCTTCAGCAAGCTTCCTATCAATGAGGACCCCCCTACCATGAATGGGTGCATTTCCAATTCAAATGTGGATTCTAGCTCATGAATAGCTTCGTTACTGAGAGAATGGGGGTGAAACAAAATGTCACCAACTACCGAGAGGTAAACGCCCACTTGACTGTGGCCCAGAATGTCTCCTGTAGAGATTCCCATAATGATCCCTCAAATTCCTGACCACTCTAGGACTTGTGATTGTTACTGATGTTTTGCGGGGCCCAGGGCGCCTCGCTGGGCGAGACGCCCTGGGGTGAAAGAAGAAGGCACAGCGACGAACGTTCCCGTGGGCTCTCGCACCACAGTACAGGAATCGACGCAGATGGGCTTGACTTCCGGGTTCGAAATGGGACCGGGTATTTCCCCATCGCTATGGCCGTGCACAACTTTCTTTCGAATGTGAATTGATATGCTTGATTGAACAGAGGCCCATACAGGGCTTAGTATGTGAATTCTTGATGCTCGGGGGATTAGTGCAGCTGGGCTGAACACCTCGGTCGAAACCTCGGTGCTTACACCCGCTGTCTATCAATCTCGTCTTTTACGAGTCCCCTGAGGTGCCTCGTCTTTCGGGCGACTTCGAGCTTAGATGCTTTCAGCTCTTATCCGCATGGAGCGTGGCTACCCAGCATTGCCTTGTCAGACAACTGGTAAACTAGTGGCTCCGAGCCCTCGTTCCTCTCGTACTAAAGGGCCCTTCCGATCAGGCACCTGATACGGCTCAACCACAAAGCACAAACCTGTCTCACGACGGTCTAAACCCATCTCAGATCCCCTTTAAATGGG
nielshanson commented 11 years ago

Indeed I can think of no good reason to remove the first base. Probably some index in the parser is off by one. I will investigate today to fix this small bug.

xapple commented 11 years ago

Well, that appears to resolve the issue of removing the first base, but now that we try it again, the whole behavior of the script is different ! Now sequences are truncated at every N character which they were not before. Look at regression_test_1 in particular:

>regression_test_0
CTAATTCATGGCTAAGTGGGAAAGCAGGTGATACGACCAAAACAACCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAACAGCTCACTGGTCTAGATAAGTTGTGTTGCGGCGAAGATGTAACGGGGCTCAAGCCATGAGCCGAAGCTGAGGACACTGTAAAGTGTGGTAGCGGAGCGTTCCGTGATATAAATTCATGCGACTTTCAGTGCACTTCCAAGTGCGTTGATGTCGAATGAGTTTTTCTGTGAAGCCGGGCTGTAAGGCATCCGGTGGAGAGATCGGAAGTGAGAATGTTGACATGAGTAGCGATAAACAGGGTGAGAGACCCTGTCGCCGGAAGTCCAAGGGTTCCTGCTTAAAGCTAATCTGAGCAGGGTAAGCCGACCCCTAAGGCGAGGCCGAAAGGCGTAGTCGATGGGAACCAGGTTAATATTCCTGGGCCAGGAGAGTGTGACGGATCACGAGTGTAGTAAGGTCTTATTGGATTGATCTTGCTGCTTAGTGGTTCCTGGAAATAGCCCTCCATTATGATCGTACCCTAAACCGACACAGGTGGACAGGTAGAGTATACCAAGGCGCTTGAGAGAACTGTATTGAAGGAACTCGGCAAAATACCTCCGTAAGTTCGCGAGAAGGAGGCCCTATATTTAGGCAACTAGATGTAGGGGACACAACCCAGGGGGTGGCGACTGTTTACTAAAAACACAGGGCTCTGCGAAGCCGCAAGGCGATGTATAGGGTCTGACGCCTGCCCCGTGCCGGAAGGTTAAGAGGAGGTGTGCAAGCCCCCGAATTGAAGCCCCCGGGAAACGGCGGCCGGAACTATAACGGTCCTAAGGGAGCGAAATTCTTGGCCGGGAAGTTCCCACCTGCACAAAGGGGGTAA
>regression_test_1
GGGTAGTTTGACTGGGGCGGACTCCTCCCAAAGAGTAACGGAGGAGCACGAATGTGCGCTAATCATGGTCGGAAATCATGAGGTTAGTGTAAAGGCACAAGCGCGCTTGACTGTGAGTCTGACAAGACGAACAGGTACGAAAGTAGGTCTTAGTGATCCGGTGGTTCTGTATGGAAGGGCCATCGCTCAACGGATAAAAGGTACTCCGGGGATAACAGGCTGATACCGCCCAAGAGTTCACATCGACGGCGGTGTTTGGCACCTCGATGTCGGCTCATCACATCCTGGGGCTGAAGCCGGTCCCAAGGGTATGGCTGTTCGCCATTTAAAGTGGTACGCGAGCTGGGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGGCGTTGGAGAATTGAGGGAAGCTGCTCCTAGTACGAGAGGACCGGAGTGGACGAACCTCTGGTGTTCCGGTTGTGTCGCCAGACGCATTGCCGGGTAGCTACGTTCGGACAGGATAACCGCTGAAAGCATCTAAGCGGGAAGCCTCTCCCAAGACTAGTTCTCCCTAACCTCTTGAGGGTTCTTAAGAGCCGTCCAAGACTATGACGTTGATAGGCTGGATGTGGAAGCGTTGTCAGGCGTTGAGCTAACCACTACTAATTGCTCGTGCGGCTTGGACATATAATCCCAGTTCGTTAGGTGTGCTAAGCTCCCGCAGGAACACTCTGGGTGTTGTGACCAGTCCTTGGGAGGAGCAGGGTTTAATCGCCGGCAAAGGATTCCGGTTTTTTTTACCCGACCCCCTGTAGAAGGCCCTTGGAGCTGCCCGCAAACCACCTGTCATCTAAATCGAATTTCCCGGGCGACAACAGAGCCTTTGGAACCCCCCGGA
>regression_test_2
CCGGGAGGGATAACAGGGCTGACTACTGCACCAACATCGTTGCATTCAACTAAATTCCCTGCGGCATTGACTCCACTCTGCATGACGACTACGTCTCCGAAGCTGGTTAGTTCATCTAGTTCATCCTCCGTAGCTATATCAGCAACAGCGATTCCCTTTCTATTCCCCTTCAGCAAGCTTCCTATCAATGAGGACCCCCCTACCATGAATGGGTGCATTTCCAATTCAAATGTGGATTCTAGCTCATGAATAGCTTCGTTACTGAGAGAATGGGGGTGAAACAAAATGTCACCAACTACCGAGAGGTAAACGCCCACTTGACTGTGGCCCAGAATGTCTCCTGTAGAGATTCCCATAATGATCCCTCAAATTCCTGACCACTCTAGGACTTGTGATTGTTACTGATGTTTTGCGGGGCCCAGGGCGCCTCGCTGGGCGAGACGCCCTGGGGTGAAAGAAGAAGGCACAGCGACGAACGTTCCCGTGGGCTCTCGCACCACAGTACAGGAATCGACGCAGATGGGCTTGACTTCCGGGTTCGAAATGGGACCGGGTATTTCCCCATCGCTATGGCCGTGCACAACTTTCTTTCGAATGTGAATTGATATGCTTGATTGAACAGAGGCCCATACAGGGCTTAGTATGTGAATTCTTGATGCTCGGGGGATTAGTGCAGCTGGGCTGAACACCTCGGTCGAAACCTCGGTGCTTACACCCGCTGTCTATCAATCTCGTCTTTTACGAGTCCCCTGAGGTGCCTCGTCTTTCGGGCGACTTCGAGCTTAGATGCTTTCAGCTCTTATCCGCATGGAGCGTGGCTACCCAGCATTGCCTTGTCAGACAACTGGTAAACTAGTGGCTCCGAGCCCTCGTTCCTCTCGTACTAAAGGGCCCTTCCGATCAGGCACCTGATACGGCTCAACCACAAAGCACAAACCTGTCTCACGACGGTCTAAACCCATCTCAGATCCCCTTTAAATGGG

Should ambiguous bases be included in the input to the further steps or is this not recommended ?

nielshanson commented 11 years ago

Hey Lucas,

Yeah this is our intended behaviour for dealing with ambiguous bases. Not quite sure if this is the best way to go. The idea is that these tend to be long stretches of ambiguous bases which are unlikely to help you predict an ORF. However, I also think prodigal can accept ambiguous bases if they are changed to capital 'N's but I've have to double check. We're not sure what its exact behaviour is in this case.

If you want to prevent this kind of splitting from happening you can comment out lines 85 and 86 of MetaPathways_filter_input.py and add:

sequence = re.sub(r'[^atcgATCG]','N', sequence.strip()) subsequences = sequence

or leave out the whole substitution bit altogether: sequence = sequence.strip() subsequences = sequence

Hope this helps.

Niels

On 2013-08-01, at 2:40 AM, Lucas Sinclair wrote:

Well, that appears to resolve the issue of removing the first base, but now that we try it again, the whole behavior of the script is different ! Now sequences are truncated at every N character which they were not before. Look at regression_test_1 in particular:

regression_test_0 CTAATTCATGGCTAAGTGGGAAAGCAGGTGATACGACCAAAACAACCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAACAGCTCACTGGTCTAGATAAGTTGTGTTGCGGCGAAGATGTAACGGGGCTCAAGCCATGAGCCGAAGCTGAGGACACTGTAAAGTGTGGTAGCGGAGCGTTCCGTGATATAAATTCATGCGACTTTCAGTGCACTTCCAAGTGCGTTGATGTCGAATGAGTTTTTCTGTGAAGCCGGGCTGTAAGGCATCCGGTGGAGAGATCGGAAGTGAGAATGTTGACATGAGTAGCGATAAACAGGGTGAGAGACCCTGTCGCCGGAAGTCCAAGGGTTCCTGCTTAAAGCTAATCTGAGCAGGGTAAGCCGACCCCTAAGGCGAGGCCGAAAGGCGTAGTCGATGGGAACCAGGTTAATATTCCTGGGCCAGGAGAGTGTGACGGATCACGAGTGTAGTAAGGTCTTATTGGATTGATCTTGCTGCTTAGTGGTTCCTGGAAATAGCCCTCCATTATGATCGTACCCTAAACCGACACAGGTGGACAGGTAGAGTATACCAAGGCGCTTGAGAGAACTGTATTGAAGGAACTCGGCAAAATACCTCCGTAAGTTCGCGAGAAGGAGGCCCTATATTTAGGCAACTAGATGTAGGGGACACAACCCAGGGGGTGGCGACTGTTTACTAAAAACACAGGGCTCTGCGAAGCCGCAAGGCGATGTATAGGGTCTGACGCCTGCCCCGTGCCGGAAGGTTAAGAGGAGGTGTGCAAGCCCCCGAATTGAAGCCCCCGGGAAACGGCGGCCGGAACTATAACGGTCCTAAGGGAGCGAAATTCTTGGCCGGGAAGTTCCCACCTGCACAAAGGGGGTAA regression_test_1 GGGTAGTTTGACTGGGGCGGACTCCTCCCAAAGAGTAACGGAGGAGCACGAATGTGCGCTAATCATGGTCGGAAATCATGAGGTTAGTGTAAAGGCACAAGCGCGCTTGACTGTGAGTCTGACAAGACGAACAGGTACGAAAGTAGGTCTTAGTGATCCGGTGGTTCTGTATGGAAGGGCCATCGCTCAACGGATAAAAGGTACTCCGGGGATAACAGGCTGATACCGCCCAAGAGTTCACATCGACGGCGGTGTTTGGCACCTCGATGTCGGCTCATCACATCCTGGGGCTGAAGCCGGTCCCAAGGGTATGGCTGTTCGCCATTTAAAGTGGTACGCGAGCTGGGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGGCGTTGGAGAATTGAGGGAAGCTGCTCCTAGTACGAGAGGACCGGAGTGGACGAACCTCTGGTGTTCCGGTTGTGTCGCCAGACGCATTGCCGGGTAGCTACGTTCGGACAGGATAACCGCTGAAAGCATCTAAGCGGGAAGCCTCTCCCAAGACTAGTTCTCCCTAACCTCTTGAGGGTTCTTAAGAGCCGTCCAAGACTATGACGTTGATAGGCTGGATGTGGAAGCGTTGTCAGGCGTTGAGCTAACCACTACTAATTGCTCGTGCGGCTTGGACATATAATCCCAGTTCGTTAGGTGTGCTAAGCTCCCGCAGGAACACTCTGGGTGTTGTGACCAGTCCTTGGGAGGAGCAGGGTTTAATCGCCGGCAAAGGATTCCGGTTTTTTTTACCCGACCCCCTGTAGAAGGCCCTTGGAGCTGCCCGCAAACCACCTGTCATCTAAATCGAATTTCCCGGGCGACAACAGAGCCTTTGGAACCCCCCGGA regression_test_2 CCGGGAGGGATAACAGGGCTGACTACTGCACCAACATCGTTGCATTCAACTAAATTCCCTGCGGCATTGACTCCACTCTGCATGACGACTACGTCTCCGAAGCTGGTTAGTTCATCTAGTTCATCCTCCGTAGCTATATCAGCAACAGCGATTCCCTTTCTATTCCCCTTCAGCAAGCTTCCTATCAATGAGGACCCCCCTACCATGAATGGGTGCATTTCCAATTCAAATGTGGATTCTAGCTCATGAATAGCTTCGTTACTGAGAGAATGGGGGTGAAACAAAATGTCACCAACTACCGAGAGGTAAACGCCCACTTGACTGTGGCCCAGAATGTCTCCTGTAGAGATTCCCATAATGATCCCTCAAATTCCTGACCACTCTAGGACTTGTGATTGTTACTGATGTTTTGCGGGGCCCAGGGCGCCTCGCTGGGCGAGACGCCCTGGGGTGAAAGAAGAAGGCACAGCGACGAACGTTCCCGTGGGCTCTCGCACCACAGTACAGGAATCGACGCAGATGGGCTTGACTTCCGGGTTCGAAATGGGACCGGGTATTTCCCCATCGCTATGGCCGTGCACAACTTTCTTTCGAATGTGAATTGATATGCTTGATTGAACAGAGGCCCATACAGGGCTTAGTATGTGAATTCTTGATGCTCGGGGGATTAGTGCAGCTGGGCTGAACACCTCGGTCGAAACCTCGGTGCTTACACCCGCTGTCTATCAATCTCGTCTTTTACGAGTCCCCTGAGGTGCCTCGTCTTTCGGGCGACTTCGAGCTTAGATGCTTTCAGCTCTTATCCGCATGGAGCGTGGCTACCCAGCATTGCCTTGTCAGACAACTGGTAAACTAGTGGCTCCGAGCCCTCGTTCCTCTCGTACTAAAGGGCCCTTCCGATCAGGCACCTGATACGGCTCAACCACAAAGCACAAACCTGTCTCACGACGGTCTAAACCCATCTCAGATCCCCTTTAAATGGG

Should ambiguous bases be included in the input to the further steps or is this not recommended ?

— Reply to this email directly or view it on GitHub.