Open Hocnonsense opened 9 months ago
Hi hwrn,
Thank you for your interest in COBRA.
Regarding your questions, please see my answers below:
final.contigs.fa
as the input file for flag --fasta/-f of COBRA.final.contigs.fa
as queries, as it will take too long to finish. It is ok to use those contigs with a minimum length (e.g., 2500 bp) as queries, but as you can see from the reviewer comments, that one of the reviewers thought there may be some issues that we could not predict. We know that it is good if we could extend everything using COBRA before we do binning, but you may have to take the risk to do so. I hope this helps. Let me know if you want to discuss more.
Best, LINXING
Thanks! These days I've tried cobra on my data, However, it seemed to stoped at step [11/23] for nearly 20 hours. Is it ok?
2. PROCESSING STEPS
[01/23] [2024/02/09 20:22:34] Reading contigs and getting the contig end sequences. A total of 1980767 contigs were imported.
[02/23] [2024/02/09 20:24:12] Getting shared contig ends.
[03/23] [2024/02/09 20:24:30] Writing contig end joining pairs.
[04/23] [2024/02/09 20:24:30] Getting contig coverage information.
[05/23] [2024/02/09 20:24:33] Getting query contig list. A total of 339757 query contigs were imported.
[06/23] [2024/02/09 20:24:42] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/23] [2024/02/09 20:46:40] Parsing the linkage information.
[08/23] [2024/02/09 20:46:47] Detecting self_circular contigs.
[09/23] [2024/02/09 21:06:56] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[10/23] [2024/02/09 23:19:30] Saving potential joining paths.
[11/23] [2024/02/09 23:19:33] Checking for invalid joining: sharing queries.
Next I found that same_path
are finished. However, contig_shared_by_paths
wasn't export till now, and there is a for loop
here:
https://github.com/linxingchen/cobra/blob/7eacae7c7aea049cd1d5ad5cbab88f961aeb11c5/cobra.py#L1194-L1213
I'm curious about why there are so many pass
, is there any condision that we may care about?
And then, can this code be improved like this?:
for contig in tqdm.tqdm(set(all))
if all.count(contig) > 1 and contig not in failed_join_list:
for contig_1 in contig2assembly:
if contig_1 not in redundant:
if contig in contig2assembly[contig_1]:
contig_shared_by_paths.add(contig_1)
p.s.1, to support biopython>=1.82, https://github.com/linxingchen/cobra/blob/7eacae7c7aea049cd1d5ad5cbab88f961aeb11c5/cobra.py#L13 can be changed as from Bio.SeqUtils import gc_fraction as GC
p.s. 2, 祝您龙年春节快乐,新年身体健康,工作顺利,万事如意!
Hi,
Sorry to hear that the sharing queries step took so long, one reason is because you have a lot of joins given the huge number of queries (339757). Did you use everything >= 1000bp?
Your suggestion on the lines looks good, could you please try on your end to see how fast it could be?
Regarding GC function, could I just add one more line from Bio.SeqUtils import gc_fraction as GC
without checking which version the user may have installed? I am not sure about this. If I just change that line, users using 1.81 should meet error.
Thank you.
Happy New Year.
Best, LINXING
Thanks!
I've filtered 1980767 contigs >=1000 bp for fasta assembly, and used 339757 contigs >= 2500 as query. It's really a big project. Now the program is finished, and the log is here:
2. PROCESSING STEPS
[01/23] [2024/02/09 20:22:34] Reading contigs and getting the contig end sequences. A total of 1980767 contigs were imported.
[02/23] [2024/02/09 20:24:12] Getting shared contig ends.
[03/23] [2024/02/09 20:24:30] Writing contig end joining pairs.
[04/23] [2024/02/09 20:24:30] Getting contig coverage information.
[05/23] [2024/02/09 20:24:33] Getting query contig list. A total of 339757 query contigs were imported.
[06/23] [2024/02/09 20:24:42] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/23] [2024/02/09 20:46:40] Parsing the linkage information.
[08/23] [2024/02/09 20:46:47] Detecting self_circular contigs.
[09/23] [2024/02/09 21:06:56] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[10/23] [2024/02/09 23:19:30] Saving potential joining paths.
[11/23] [2024/02/09 23:19:33] Checking for invalid joining: sharing queries.
[12/23] [2024/02/11 03:45:03] Getting initial joining status of each query contig.
[13/23] [2024/02/11 03:59:01] Getting final joining status of each query contig.
[14/23] [2024/02/11 03:59:14] Getting the joining order of contigs.
[15/23] [2024/02/11 04:00:07] Getting retrieved contigs.
[16/23] [2024/02/11 04:00:15] Saving joined seqeuences.
[17/23] [2024/02/11 04:00:21] Checking for invalid joining using BLASTn: close strains.
[18/23] [2024/02/11 04:02:48] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[19/23] [2024/02/11 04:02:51] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[20/23] [2024/02/11 04:02:51] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
[21/23] [2024/02/11 04:08:00] Saving joining status of all query contigs.
[22/23] [2024/02/11 04:08:31] Saving self_circular contigs.
[23/23] [2024/02/11 04:08:31] Saving the new fasta file.
3. RESULTS SUMMARY
# Total queries: 339757
# Category i - Self_circular: 181
# Category ii - Extended_circular: 0 (Unique: 0)
# Category ii - Extended_partial: 17524 (Unique: 11521)
# Category ii - Extended_failed (due to COBRA rules): 76038
# Category iii - Orphan end: 246014
# Check "COBRA_joining_status.txt" for joining status of each query.
# Check "COBRA_joining_summary.txt" for joining details of "Extended_circular" and "Extended_partial" queries.
I've submit the job again, and it may take three days to finish. If COBRA
can start from the last broken running, it will be very helpful!
for biopython, gc_fraction
is introduced to replace GC
since 1.80, so only those using biopython<=1.79 will meet error.
Regards, hwrn
Hold on. For the -f flag input you should use all contigs without length filtering. Thats why did not get any extended_circular sequence.
For megahit itself, there is a param --min-contig-len
which will control the length of output contigs (default 200). Other assemblers also have similar params to filter shorter contigs. On the other hand, you also indicated that intermediate contigs with much shorter contigs should not be used.
For your opinion, which threshold is proper for assembly before COBRA
? Thanks!
Meanwhile, if I already mapped reads to unfiltered final.contigs.fa
, can I use this coverage for binning directly, where only contigs >=2500bp will be used? Another choice is mapping reads to the subset of contigs >=2500bp and generate another abundance file. Which is preferred in your opinion? Thanks!
For megahit itself, there is a param
--min-contig-len
which will control the length of output contigs (default 200). Other assemblers also have similar params to filter shorter contigs. On the other hand, you also indicated that intermediate contigs with much shorter contigs should not be used. For your opinion, which threshold is proper for assembly beforeCOBRA
? Thanks!Meanwhile, if I already mapped reads to unfiltered
final.contigs.fa
, can I use this coverage for binning directly, where only contigs >=2500bp will be used? Another choice is mapping reads to the subset of contigs >=2500bp and generate another abundance file. Which is preferred in your opinion? Thanks!
Hi,
I do not suggest changing the default value of --min-contig-len
of MEGAHIT or the similar flag of other assemblers.
Technically you should use the bam/sam file mapped to all unfiltered contigs to get the coverage file for binning.
Please keep in mind that for COBRA, -f/--fasta = all the contigs from an assembly, -q/--query = the contigs you want COBRA to extend.
Thanks for your kind advices! They help me a lot!
After rerun, it saved a lot of time and (the loop I marked above only took 18 minutes)
2. PROCESSING STEPS
[01/23] [2024/02/11 13:05:52] Reading contigs and getting the contig end sequences. A total of 1980767 contigs were imported.
[02/23] [2024/02/11 13:07:30] Getting shared contig ends.
[03/23] [2024/02/11 13:07:48] Writing contig end joining pairs.
[04/23] [2024/02/11 13:07:48] Getting contig coverage information.
[05/23] [2024/02/11 13:07:50] Getting query contig list. A total of 339757 query contigs were imported.
[06/23] [2024/02/11 13:07:59] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/23] [2024/02/11 13:30:10] Parsing the linkage information.
[08/23] [2024/02/11 13:30:17] Detecting self_circular contigs.
[09/23] [2024/02/11 13:50:36] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[10/23] [2024/02/11 16:02:44] Saving potential joining paths.
[11/23] [2024/02/11 16:02:47] Checking for invalid joining: sharing queries.
[12/23] [2024/02/11 20:52:32] Getting initial joining status of each query contig.
[13/23] [2024/02/11 21:06:13] Getting final joining status of each query contig.
[14/23] [2024/02/11 21:06:26] Getting the joining order of contigs.
[15/23] [2024/02/11 21:07:18] Getting retrieved contigs.
[16/23] [2024/02/11 21:07:30] Saving joined seqeuences.
[17/23] [2024/02/11 21:07:46] Checking for invalid joining using BLASTn: close strains.
[18/23] [2024/02/11 21:10:13] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[19/23] [2024/02/11 21:10:16] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[20/23] [2024/02/11 21:10:16] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
Hi, thanks for the update.
I am confused. (1) did you use the several lines you wrote to replace those in the original script? (2) which step took only 18 mins? I do not see that. Please clarify. (3) The number of contigs in step [01/23] remains the same, did you still use those >= 1000 bp for -f/--fasta input?
Should be great if you could let me know what you have done.
Thanks for your quick reply!
yes, I edited line from 1198 to 1209 in cobra.py to:
for contig in tqdm.tqdm(set(all))
if all.count(contig) > 1 and contig not in failed_join_list:
for contig_1 in contig2assembly:
if contig_1 not in redundant:
if contig in contig2assembly[contig_1]:
contig_shared_by_paths.add(contig_1)
of course, the module tqdm is imported first
tqdm reported the time spended for the loop in (1)
Yes, to compare with the last results, i used the same input. Next time I will try to use the original output of megahit. Will shorter contigs improve precision of cobra results? (for example, indicate the merge of two long contigs is unreliable)
Great.
for 1 and 2, could you please compare and let me know if the results are the same, before and after you edited the lines? If it works, I will update these lines in the next release.
for 3, yes, COBRA needs those very short contigs to connect long contigs to make them longer. You could check Figure 2f in the paper, and will find the short ones are very important.
I've checked the results. Unluckly, the two results is not the same. I think this is caused by hashing of python
, which will iterate keys (contig id) in different order in different runs.
I've checked an example:
related contigs:
>M72_2|k141_22888658
GGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG
>M72_2|k141_10144865
TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAAT
in the first run, this two sequences are in COBRA_category_ii-c_extended_failed.fasta.summary.txt
-- they are Extended_failed category_ii-c
in the second run, these two sequences are in COBRA_category_ii-b_extended_partial_unique_joining_details.txt
:
Final_Seq_ID Joined_Len Status Joined_Seq_ID Direction Joined_Seq_Len Start End Joined_Seq_Cov Joined_Seq_GC Joined_reason
M72_2|k141_10144865_extended_partial 5623 Partial M72_2|k141_10144865 forward 2880 1 2880 36.744 0.452 query
M72_2|k141_10144a865_extended_partial 5623 Partial M72_2|k141_22888658 forward 2884 2740 5623 23.535 0.467 the_better_one
and the joined sequence is:
>M72_2|k141_10144865_extended_partial
TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG
------ edited and added below ------
However, when I check intermediate.files/COBRA_end_joining_pairs.txt
, I found that both results are the same (contig id are k141_22888658
and k141_10144865
):
M72_2|k141_10144865_R M72_2|k141_22888658_L
M72_2|k141_10144865_R M72_2|k141_9211005_Rrc
M72_2|k141_10144865_Rrc M72_2|k141_22888658_Lrc
M72_2|k141_10144865_Rrc M72_2|k141_9211005_R
M72_2|k141_22888658_L M72_2|k141_10144865_R
M72_2|k141_22888658_Lrc M72_2|k141_10144865_Rrc
M72_2|k141_9211005_R M72_2|k141_10144865_Rrc
M72_2|k141_9211005_Rrc M72_2|k141_10144865_R
The related sequence, k141_9211005
, related to a series of sequences:
M72_2|k141_10144865_R M72_2|k141_9211005_Rrc
M72_2|k141_10144865_Rrc M72_2|k141_9211005_R
M72_2|k141_11312800_R M72_2|k141_20768437_L
M72_2|k141_11312800_Rrc M72_2|k141_20768437_Lrc
M72_2|k141_13152222_L M72_2|k141_6797120_Lrc
M72_2|k141_13152222_Lrc M72_2|k141_6797120_L
M72_2|k141_13152222_R M72_2|k141_20768437_L
M72_2|k141_13152222_Rrc M72_2|k141_20768437_Lrc
M72_2|k141_18538400_L M72_2|k141_20768437_R
M72_2|k141_18538400_Lrc M72_2|k141_20768437_Rrc
M72_2|k141_20768437_L M72_2|k141_11312800_R
M72_2|k141_20768437_L M72_2|k141_13152222_R
M72_2|k141_20768437_Lrc M72_2|k141_11312800_Rrc
M72_2|k141_20768437_Lrc M72_2|k141_13152222_Rrc
M72_2|k141_20768437_R M72_2|k141_18538400_L
M72_2|k141_20768437_Rrc M72_2|k141_18538400_Lrc
M72_2|k141_6797120_L M72_2|k141_13152222_Lrc
M72_2|k141_6797120_Lrc M72_2|k141_13152222_L
M72_2|k141_6797120_R M72_2|k141_7723568_Rrc
M72_2|k141_6797120_Rrc M72_2|k141_7723568_R
M72_2|k141_7723568_L M72_2|k141_9211005_Lrc
M72_2|k141_7723568_Lrc M72_2|k141_9211005_L
M72_2|k141_7723568_R M72_2|k141_6797120_Rrc
M72_2|k141_7723568_Rrc M72_2|k141_6797120_R
M72_2|k141_9211005_L M72_2|k141_7723568_Lrc
M72_2|k141_9211005_Lrc M72_2|k141_7723568_L
M72_2|k141_9211005_R M72_2|k141_10144865_Rrc
M72_2|k141_9211005_Rrc M72_2|k141_10144865_R
in the first run, contig k141_9211005
is in COBRA_category_ii-c_extended_failed.fasta.summary.txt
.
in the second run, contig k141_9211005
is not in COBRA_category_ii-b_extended_partial_unique_joining_details.txt
.
oops. can you share me the potential joins file? i can take a look to see which one is correct.
Best, LinXing
LinXing Chen, Ph.D. Associated Project Scientist, The Banfield Lab, University of California, Berkeley, USA 94706 Phone: (1)510-701-7864 Email: @.***
2024年2月12日 -0800 AM5:30 Hocnonsense @.***>,写道:
I've checked the results. Unluckly, the two results is not the same. I think this is caused by hashing of python, which will iterate keys (contig id) in different order in different runs. I've checked an example: related contigs:
M72_2|k141_22888658 GGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG M72_2|k141_10144865 TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAAT in the first run, this two sequences are in COBRA_category_ii-c_extended_failed.fasta.summary.txt -- they are Extended_failed category_ii-c in the second run, these two sequences are in COBRA_category_ii-b_extended_partial_unique_joining_details.txt: Final_Seq_ID Joined_Len Status Joined_Seq_ID Direction Joined_Seq_Len Start End Joined_Seq_Cov Joined_Seq_GC Joined_reason M72_2|k141_10144865_extended_partial 5623 Partial M72_2|k141_10144865 forward 2880 1 2880 36.744 0.452 query M72_2|k141_10144a865_extended_partial 5623 Partial M72_2|k141_22888658 forward 2884 2740 5623 23.535 0.467 the_better_one and the joined sequence is: M72_2|k141_10144865_extended_partial TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Let me know when you have time to share the file. :)
Sure! Sorry for the late reply.
I've select all related sequences (there are only 10 seqs) and here are them:
input fasta:
>k141_13152222
TTTTTTATTTTTCTCGTCATGCGACTAGCGCCCGATGGCACCCGTTGGTGCTGAAAAGTTGCTGAGATTGCTCGGGAAACTTCGTAAAAGAAAGTTTTTTTTGATAAACAAGGGTATCATCGTCTTCGGACTGAAATGCAGCCCACCCTTTTCCTTGCCGCGCACTACCCATTTTTTTGTAAAACTTTCCCGGAAAAAACTTCCACTTTTTCCCTTTTTCAGTGCTCTGCAGTTCTTTCTCAATTTCTTCTTCAGTGAGGTGCTCCGTGGTTTTTCCTGTTATTTTGTCAAGTTGTTTGCGATATGCGTCTTCTTCAAGGAGAACAAAATATGCAAAAACTTCAGCTTCGTTCGTCAGATAGGCCGAAGTGCACACACATAAGCAGCAGGAATTTCCTAAGCGAGTGCACAGCGAACTAACGACTGAAATCAGGGCGTGGACGGAAGTAGGAGGAGCATGAAAAAAAAAATATGCTTTTGTCGATGAAACAAATATTAAAGATGGAGCTTCAGACCCAGCGCGAGCCCATTCATCCGCATGCAACAACCGAGCAACTAAATTCTCCCACTTTTCCACAGAACCTGGTGGCTCTACGGAAGCTTCGACGATAATCGAAATGTTCTTCTTCAAAATGGGGTTAGCCATTTTTTTGGCTTCCTCGATTTGTTCCTTGAGCCTACTGTAATAAGAAAGATTCATTCCTTGGGCAGGAAACTTTTCCTTTTAACTCTTTTAACTCTTTTATCGAGAGCGGACAAATTCAAGGAGAGTTAAACAGAAAAAGGATGGCTTCTAGGAATGAACCGACGCGACTGTATCTTCTACGGCGAAGTTGTCAAAGGCTACATAATCAAAACACTTATTGACGTCTTAGTGGGAAGTTTTAACCGGACTTGCTTTACGGTCACCAAGGATGGAGTTTTTTTACGAGAATGCGATAAAAATAGGAGCATTTTGTTTAATATAGAACTCTACAGAGAAAAGTTCAAAAAGTACAAGTGCGATGCGGACATCCATTTCAGTGCAAACGTAAAGCATATCCAGAGACTGATTCGAAATTTAAAAAAAAAAGATTCTCTGATTCTCGGTATTCGCCGCTCTTCGCCCGAGATGCTCTGTATAATGATCTGTCCGGCGCGAAAACCGGACAATACTAACTTCCGAATGGAAACGGCGGACATTCGAATCCAACTAGAGTCCCAACCAAATTCCGTCGTGATCCCCGATCCTTCGGTTTATTCTTACCCTTATGTCATTGACGCAGGAGAATTCCAGAAAGTTAAACGCATAGCCAGCGTAGCCAAAACGATCAGAGTTATCATTCGCGGCGACAATTATTTAGGCTTTTTGTGTGACAAAGAGATTTATTCTACTGCGTTACACTTTGGAGATCCCCAAGCGACCGAAATCCCGAAGATCTCTCGGTCACCCGAGGATCCGGCCGACCCATACGATTCGGTAGAGTCTGGGGAAGAAGAAGAAGATCTAAAATCTGGTACGGGGAAAGAAATGCGAGAGTACACTGCTGATTTCCATTCTTCTTTGTTCAACCATTTAGTCAAACTACCTGGGCTCTGTACCCAAATGCAGTTCTATGCCCCTATTGAAGAACTTTGGCCTTTACGTATTCGAATGGAAGCGGGATTACTGGGCAATATCGAAGTGTACATCAAAGACGTCCGAACTCTGGAGTACGACGAAATTGAGCAGTAACTATGGTTGCGACCAATTAATTCCTTTTTGTCTCAATTTTTGCTTTATGTAAGGCTCCATTGCTTCGGGAGGAATGAAATGTGGTATTCGAATCAAAACTATTCCTTGAGCTTGACATTTTAGGTCTTTCCACGTGTCTCGTTTTTGTTGGTAAACGAACTCCTTCGGTCCCTTTTTGTGAAAGTAAGCCCTGAATTTTGAGTGCTGAATCCCGTCGTACTCAAACGCAAGTCCTCTTCCGAACGGAGTTTTTAAATCTGCACAAAACCCGTCAAGTTCGAGTCTTTTTCCGGTGACTGGGTTTACTAAGAACTGCGGACGCTCGTTAGGAAAAGATTGCCGGAAGAGACGCTCGAATATAGCTCGACACTTTTCTTCGCTTTTGTTCGTACGGGGTTGCTTTGGTACTCCTTTTACCGAGGGTGACCCATCTTTTTTCGCAGCATTCGTGGCTTTGTCTTTTTGAACTGCTTCTCGTACCCCGAGACCCGAAATACCTTTGAAAGGGTACTTTTGTTTGGGAGCTTTCCAAGCGTAGCCTAGTGCTATTGCGACTATAACTACGACGAACACCGAGTGCACGCTAACCCATCGACTTTGGAACAAGTGGCGACAAGTACTCCACATTTATCACTCTAGGAATTTTCCTTATCCCTAATCTCTTTTTTGAGTAGCTTCGAATTTTTAAAGAAAATTGTAGATGAAAGTTCTATTATACTTTCATTTACTAAAAACATAATTCCTAAATGGAATAGAGATTAGAGTCTCTAGAATTAAGCTCAGTAATGTCGTGCTCGGTGGACGCTCGGGAACATACCCCTTTTGTTATATCTCGATTGCAACGTACTTCCCCAACGTTATTTCTAGTGACCCTCGACGGCACACCGCAAGGCTTTGCGGAAACCGAAGAAACTGCCCGTTTCTACGTTAGAGCGCTCGCAGATGATTTAAAATTTCGACTCGGAGTGCTTCACCCGGCGAACGAGTTTCAGATCGATCTCTCCCTACCCACGGTCAGCGTTTTAGAAAATCAAAAGGGTTTAATTTTCGATTCGGGGTTTCAGGTTATTCACCGAATTGCTTTCTTCTCCATTTCTCTGTTGGACGTGAATACCTTGAAGGTTTAGGGTCGCACAAATTTCCAAAAAGTTGTTAATTATGAAAATGGTTGACTGCACACCCCAAATTTTGTTAGGCCGGTTTTCTGTTGATTAAAGAAATCGTCAGTTCCTCGCTAAGGCGTTGGGGTGTCAAAGCACTCAAAAATAAAGACGAATTTTTTCGAAGAGCGAACTCCCGTTTTTAATTTTGTACAAATCAGGTTAAAAGAACGGTCCGCACTCATAGGAAGAATATTTTTGTGCATGAAACTTGAGGTGTGGGTCACGATTGACGTTCGGGGTAAAAAAGTCGGGTTCAGCTCTCTCGGGCGGTTCCAAGACTCGAATGGGATTCGCAAAAGCGTTGCGCCACAGGAAGACGGGTATTGCACAGTTATGGTGGATTCTCAAACCCACTCCTTCCACGACTTGGTGTGCACCGCTTTTCACGGGGTGAAATCCTCCCCTGACCTTGAGGCGCACCACATCGACCACGACACGACCAACAATCGACCCGACAATTTATGTTGGGTGACGCACCAAAAAAATATGCAAGAAAGTTACCGCACCCAAACGCGAAAATCGAGTGGTCCGCAACGAAGCCGACAAATCCTGGGACGGAAGCACAAGTCCACGAAGGAATGGGTCCCGTACGCGAGCATGAAGGCGGCAGCAGAGGAGCTCGGGTTAACCGTCGGACCCATCAGCGCCGTCGCAAGGGGAAAACAGCGTCAAACCGGCGGCTACGAGTTCAAATTGGCACCGTCGCCGGATTTGCCGGGGGAGATTTGGAAATTGCTCACGGTGAACACCAAGAAAGTTCAAGTGAGTTCCTTGGGGAGATTCACGGACTCCCGAGGATTGAAAAAGTCGCCCGTGCCTAGCCGCTCCGGGTATTGCCGCGTCAAGATCAACCGGAAAACGTACTACGTCCACCGACTGGTGTGTGAGGCGTTCTGGGGCCCCTCCCTTGGGTTGGAGGTCAATCACAAAGACCTAAACAAATCCAATAATCATTATATGAACTTGGAATGGGTGACGAGACGTGACAACACCTTACACAGTTACAGTACCAACAAAAACCGCCGTTCGAGTGCCGCCAAACAGAGCAAGCCGGTGTACGGCCGCAAGCACCAAACCAACGACGAGTGGGTGGAGTACCCGAGCATGAGCAACGCAGCGGGAAAGCTGAATCTACACTCAGGCGCAATCTCAGCAGTCACCAAAGGAAAACAGCACCAAACAGGAGGCTACGTATTCAAGCTGAAACCGCCGGAGGAAATATCGACGCTAAGTCTACCATGTTCTTTTAATATAAAAAACTCCACCCACTCGTCGTTGGTTTCGTGCTCCTCCAATAAACGTTAAAACAAAATGATTGCACCTAACAGACACGC
>k141_6797120
CTGCATTTCAGTCCGAAGACGATGATACCCTTGTTTATCAAAAAAAACTTTCTTTTACGAAGTTTCCCGAGCAATCTCAGCAACTTTTCAGCACCAACGGGTGCCATCGGGCGCTAGTCGCATGACGAGAAAAATAAAAAAATTTCAAAATCAAAATAATACAATACAACGGGAAAGAGTTAAACAATTTTGAATTTTCAAAATAAAAAAAAAAGACCCGAGCTTACACTCTCCGTTGAATAGGTTGTTCGGCAGTGGCCGGCATGCATTTAATTCCCTTGTTTATCAAAAAACTCGCATCCGGGGTATCTATTGCTTCAATAGGTTATGATTTTTTGTTTGAAATAAAAAGCAAAAAGTCAAAAACAAATAACAAGAATTAAACACGAAGAGCGTAAACAAAAACCTGAATTTGTCTTTATCCTTTCATCGCGCTGCGCGCGCCGCCGAAGGCGGTGCAAAGCACGTTACCCTGCGGGCCTCCGCTACGCGGAGTGAACGCTGGAAAGGAAAGAAAAGTCAAATCCCGTGCTAAAGTACGGGAAGGCACGGATGAGCGGTCAAAATATTGAAGATTACATTAATTGTCCGGTCACCCAATACGGCGAGGTTTTTTATTTTATTTTCCTCTAAAAATGGAAGACGAAAAGCGCGATCCCAAACGAAGTATGGTAGACGAAAAGAAAGAAGAATTCTTGTACACGATTAAGGGGATCGCTCTCCCGAACACCCCCGTGTTTTACTCAACCCTTACCGACCCACGCACCGACCCACACACTCGTTACACTATTTCGGACCAAAATGGTATGTGGGAAATAGTCGGAAAGGTGCAGGACATCCCCACCCCTCCTCACTGGTTTTTTTGGACATCTGAGAATGAGAAGCAGCGAAAAGTGGATAACGAAAATTTCTATGCTTTATCCTTCAACGGCGACGAGATCGATGATCCCCTCAAGTTCCGTTCGAGATTTGAGAACCGTGAGCTCTCCATTACTAAATTCTCCATGGACAGTCCTATTTTTGGAAGCTCCTCAAAGCCTGGAATTCAAACCGAAGAAGACGATTTTCAGGGATTTGGGTTGCGACGCCACCAGTTTAAGTTCCCCGGCGAACTGGGGATGACGGTCGCGAGCTACCTTGAGCCCATCTTATGCGGGTGGTGTGCTAGGAAGAAGGTCAAGGACAAGTTAAGTAAGTTGTGGGAACAGGCTTGGAATGAGTTCGGCACACAGTGGCGGATCAGCGGATCGATGCCTCCTTGGTCTAAGGGTGAATGGACCCTTCAGGGAGGCGAGCAGAAAGGGATGTACGAGTACCACGGTTACGGAGATGACCAGGCCAACGAGTATTGGTTGAGTCCGATAGTCGATCGCCCGTGGATAGAGAGCAACGTTCCTCACCCTCTAGGCTTTACTGTGAATGTCCCAGCAATCTTGCAATATCTCAAGGAATGGTATACTGACCAGTTGCACGATGACAGTTACGTGAAAGCGCACATTTGGCATCAGGACGGCCACTTGCGAATAAGCGCTGCTTCTCTTTGTCCGCCCTGCCCGTGCTCATTGCACTGGTCACCTCCCTACCAGAGTTGCTACAACTGGCTGGTGAGTCATGAGGAGTTCCTGAACTCGTCGTCCGGGAAGAGATGGCTTTTATCGAAAGGTGCGCAGGAATTTTTACGCTCTAGTTGGTTCTATGTATGGGCCTACCTTACTGACTGGTACGCAAAAATGCTAACCTCCACAAACCCTTCGCTTGAGAAATTAGCTCAGAAGCTGCTCAAAGAAAGACCAGTCTTCTTTCTCCAGTCCCGGGACTTTCGGCTTCTAACTGAATCTATAGAGGGACAACCAGCCAGGTCCTCCACCTCCATGGGCACGGGCACTTCGACAGTTGCGCAGTGGCGAGCCACCAATAAAGAGCTGGAGATCCAGCTGGGCAGAACTCTTAAGGATTGGACCGAAGCCGACGTGCTAGCTTGGCTCGTCCTCTTACCTCTCACGTACTCGCACCCTTCCCTTAATCCCTATACCCTGAAAATCGCAAAAGCCCTGGATAAAGACGTTATACCCCTGTTCAAGAACTTAAAAGTCACCGGAGAGTTTTTGAGCGACGATACCAAGGTTCAAGGTCTAGAAGAGGAAATCACTCGACGATGGCATCCTGATGAGGTGGTGCGGTCCGGCATTCTACACGGTTATGCTTTATGGGAGGAGAACTGGTTCTTGGAACGCCTTAGGGTGTTGCGAAACTCGGACACGGGATAAGCAGTGTGTGTTGAAACGAGCCGAGCGGCAGCAGATCCATCGAGAAATACCACAACAACCAACAGGCAAGTTTACTGAAAAGCCATACTTAAAAAAAAAAGGAACATAGACGAGAAGATAATCTAGCAAAAAATAATGCTTATTGAGAAAGATGCCTCTCTGCCGCAGCGACGTTACTCGCCGCTACAAGGGAAATGAGCCTTCGCCTTCAAAAAAAGGAAAAAACTATCGCAACTTGTAGCCAAAGTTTTTTCCTTTTTTTGTTGCCCGCCCGTTTCATGGACGCTCCATTGCATCCCCGAACGACAAGTCGCGGAGGGTCTTCCTTGAACCTTGCGCTCAGACATTTTGATTCGGAATTGACGACTGATCAGCACAGAAAACTTATAGTGGGTTACTCAAAACAACATGAAGTTCACAGTAACGATGGTCGATAGAGAAATAAAGCAAACAAAAT
>k141_22692969
GATTCAACCCGGACAAGTATATTTCCAAGTCAGGAAAAAAGAAAAATCCGTGTATGGCCACCCGTCTGGAAAAATTAAAATTGGAAATCGAAAAACAAATAGCACGCGTTGAAAAGGGTGAAAATGTTGAATTAGTAGAGAGGATATACATGTATTACGGAAGATACACTTAATAGGCTTTCAGAGGCTGACGAAGGCTACGGTGGTAAGTAGAACCTGAATTACGACTGGAGACGAGCCAAGTTCACTGTTACTCCCCCGGCCGTAGTGTCGGGGTACGACGGGATTAAAATCGTTTCCATTCCGTTTCCAATTCAAAAGAGATGGCAAATCGCCTGCTTCCGAAGAGATCGAGAAGCGTGTTTGTTTAACAAACACTGTTTAACTCCTCCTTTTTTTTTGAAAACATCCAGTTTAATTCGTCTGTTTGATCTGATGTGTTTTATTTTTCTCTCCCGAAAAGACAATGAGTACCCCCGCTGCCGTTACCGGAATGTTTGTAGACTTCTCTTCTGCGGCCTGGTGTGACTGCCTTGAAGAGCCGATTTCAGGTATGCATAGTCGATACGTGTGCATTACCGGGGTGAACTATATGACTAACACAATATTCAAAATTAAGACCATAGATGTTGACTCGCAAGTTTATGACCTTTATCAGGCCGCATATCTGGACCTACGCAAAGCCGACGGAAGCTCTGAACTTCTAGACATGGGTTTACCGCTTCCCAATTCTCCGGTTCCCCTTCGAGGAAACTTTTCGACCTTATTCGGGCTCCCGGGCAACTCTTACGGGCAGGTCGTTCCTGCAGTGCAATTCAACGTTGGAGATCTTCTTGTAGGGTACGTCCCTGTAGAATTTCTGGAACGGAGTTTGGGCCCAGTTTTGCGTTCCCCATGCACGCAGTTCCTACGGCGACCGTGAACAAACTACTATTTTCAAAAATTTGTGTTTGAGATTTTACATATTCTAGACGTGTAACAGCGTCTCCTGAGGAGAAAACACCATGCAGAAAAAAAACAGAGAAAAAGAGCGGTGCCAACAGAACACAACTCCGGGGAAGGACTTGACGACAGGAAACTCCGAAAAAAGATGTACCCATTCTGATCTTTCGGGTTTCTAACACACTCTGCACGAGCAAAGCACGTTACCCTGCGGGCCTCCGCTACGCGGAGTGAACGCTGGAAAGGGAAGAAAAGTCAAATCCCGTAAATCCCGTAAAGTACGGGAAGGCACCGATGTATTTTTATTTTTCCGTTTTGTTGTGTATTTCGAATTAGACACGTATACGGCTTCGCGGTACGTCGGGTTCTTCATCGTCATTAACATCGTACCGTACCCTATACGTGCCGCTGGCGTCGACGGCTACTATTTGTGCGGGCCACCATTCCGCGACATATATTTTCTCCTCGGACCCTGGTGGCGCGTACCGTCCGCTCCACCACGCTTCTACTTTATCACCGACTTTAAACCCCGTAGTCAACCCTTTCTGGATCCGTGACCCTTTCTGGATCGTTTTTCGTGGCCCTTTCGTTGGTGCTCGCCCTTTCTGTCGTTGCACGGACTGTGGGGACTGTTGGGCTCCTGTGTAATACGTGACTTTTCCCGGGAAGTCCGGGTCTTCGGAGATGATTAATGCCCCTTGCAGGGGCGTGACTCTCACGGCCTCGTCGACGACTAAATACTTTACACGGTCCGACTCACGTGCCTTATAGTACAGTGTATTTAATATTGAGGCACGAACGGCAGCGTCAACCGAACAAAGAAGACGAATCCTGAGTACTTCCTCTTTCCTATCCGTATCATAAAGAACAAGACCACAGATTTCTCTTCTTCGAATTGGATTAGAGTCAGGGTTAGGAAATAATCCCGTTGGAGCCCAGAACTTCTCTTCCACCCCCAGCCCCTGGAAGGGGCCTGACCCTGTGAACCGCCCCTTTATTTCTGCGCTGTCGGTATGCATATCGGGATCAATTGCGCTCTGGGTACGCATTTCGGGAACAGGCTCAGGAAGCGAGGGCTTGGTAGCGATCGCGAAAAACTGGCCGTTTTCCTTCGTTGGCTGCAGCAGCGTCTTCACTGTCTCACAGTCTAATTGTTGCCGGCACTGTCCGCAGACGACTTGATTAAGTTCTTCGTTGATGCGCCACTCTTTCAAGGTGGCGGGCAACTCCCGGAGAATACGGGGATCTTCGAGTATACATGACATTTTATCTTATCTTATCTTCAGTGCTTATTTTTTTCCTGTAAACTGGGAGTTGCTGGGCATTCCCATACTTTTGCAGCACTAGCTGGCCGATACAAAGTGGAACGCAAAAATTACGTAAAGACAAAAAGACGAATTCTGACATTTTGTTTCGGAATAAAGTATTTTTATTTTTATTCGTAAACCTTTTTGAATAATAAATTTCGCATTTAACTTTTTTGATTTTTTTGTAACAAGTACGACCGAAGGTACCACACCGCCGACGAGTGGGAGGAGCATACAAATGAACTTCAGGAGTCAAGTTACCCGTCGGAAGGTAGTGCTTGTGCTTACGCTTACCCACGTATTCAGCGATCAGCTCAGCCCTGTCAGGATCGGCATCGGCAAACCTCTGGACCCTATATAAAACACTTTTAAATGAACTTTAAACATGAAATTAAAAGTTCGAAGTTTCGTATTGTTTGGACACCTAAATATAGAAAAAATTTTCTTCGTGCTCTAAGTAAAAAAGGCCATCTATGGTCCGAACACTTTCGCAGAGCGGCTGCCTCCTTGAAAGGCCTCTTAAAAAAGGGTGGCTAGGGGTGCATGGACGCATCGACGAGCAGTGTGCCAGAAACTTCTGGTTAAGTCATGTGCGCAGATCATACCGGTCGTACAGAAATTTGGACCCGGAAAAAGCGTTGGAAAAAGCCCTAAAGGACAACGAATCGACGCTCTCCGCCCGGTGCGAGGGAACCTTGGACCTGGAGAAACTGTCCTGCAGCGGATATTGGTGGGACTATAAATCGAGTCCCTCATCCCGAGCAGTTTCTACGTCGCCGCAGTGGTCAGATTCCGAATGGAAAGATATCCAAAAATTCGCTCCGGGGGACCTTCTTGAACGTCTCGGAGGGGCAACAGAAACCGCGCGTTTGAAAGCAGAGGAAGAACAAAGAGCGCGACAAGCCGAAGCCGCGCGTTTGAAAGCAGAGTTCAAATTATCGGGAGGAGAAGAATACCGAAAATATGATGCTTGGCAATTCCTGGAGGCCACGGAAAAAGAAGAACGGAAACGCAAAGAAGAAGCAGACCCAAAACGCAAAGAAGAACAGAAACGTGAGGAAGAAGAACGGAAACGTGAGGAAGAAGAACGGGAGACCCGCTCCTGGCGAAAACGTTTGGCTCTGGAATACCTCTCGCCCGAAGTTCTGAAGAACGTGCTAGAGCACCTCCCGAAAGACGACCCATTGCACCCAAAAGTGTCCGCCCGACTTAAGTCGGCAGAAGAAGACGTGTCTGCCCGACTTAAGTCGGCAGAAAGGACTCAGTCGTTCCAAGAGCGATTGCGAACTTGTGAAGCCCTTTACTCGGAGTACGCACAAACAGCCGCTTCTCCGACGTCGTCCGCTCAACAAAGAAAAAAACTCGACGAACAACTGAGTACCTTCGAGTGCGCAGAGATAGCAGAAAGAGGGTGGGAGGAGAAGTGCTGGGAGTGCCAACGGGCACCTGCACCTGGAGAGTGCAGCCAGTTTCCTCCGAACTGTCAGGTGGTGACGACTCGACCGCTACGGTGGGCAGACGCCGTTTTTTTCCAAAAGGGAATAGAATGCGACTCGAAAGCGAGACTCACTAATACTCAACAGGAATTTTACGTAAAGCATCTTCGCCGGATCGTCAAAGGTACGGACGGAAACGTGGAGGACGCCCTTAAGCACTTCCGTCCCGATGTTTGCCTCTTGGGTAAGGGACGCAATGCGGACTACATCTTCATTTTAGCTTTATTTTTCACTTACGCTATTCTCCCTTCGCTGAAGGCCCGCCAGCTGTGCCTAATTCCCACTATTGAATTCGAAGTGTTCTCGGACAGCGAAATCGAAAAAGTTTCACCTCAACTGTTAGGTGATGAGCGCAAAGCTTTCGTGGCTTCCTGTGTGACGAGAGCCCAAGAAAAAATCAATCGATTTTTCTATCCCGTACTTCCTAAGTGGCACTGGGCGCGTTCGACGACCAAGTGCCTCAGCTTCGAGCTTCAAACACTTATCGGTTATCGTAAGGATCTCGGCTTCTACTTCACTGAACCAAAAAGGGACGTCATCGTGGTACCTCGGCTGTTCCTCCATATGCGGAGGCAAAACGGCAAATGGACGTCGGAGCCTCACGATTCTAGCTGGTTCGCGCGATGGGAAGCCAAGCGAAAGCTCGAACAGGAAATTGTGGACTCTCTGAAACCGCTAGCGCAAAAGAGTTTTAAGTCTCTCTTTGAAGAAACTAAATCTCTATTTCCTTCTTTTGCCCGTTACAACCTCGTAGTGATGCTCCCCAGACATGCGACCTCGGCCATTTTTCTGCCGGACAAGAGCGAGATTTGGTACGTCGATACCGGCACCCGGGCGTTTTGCGGACACGCCTTAGGCCCGGCCACAATCGTCCGGGACGTCCTGCAGGTTTCTCCTGAATTCCGGTTTTTCACTCCAGGGTGCAGCCACCTCGCCTACAGGCAACGAGGCCCTTCATGCGCTCTCCATTCTTTGTTCTTTTTCTGGTACGGGTGCTGTAACGACTTGGCCGCGGTGCAGAGATTGTGGAATTTTCAGTCGCTGTCAGACTGGAGAAAACACAAGGAGTTAGTGAACGCGCATCGCCGGGCTCCGTTTTTGTGCCCA
>k141_9211005
GAAACGCACCCCTCCGTGGCAGAACTCAGTCCCCTGAACATGAGAGGATTGAGGCTCGTAGTGGTGGGACATGCGGTAATGGTGATCTTCACTCCGGAGAGCTTCTCCGAACAGGTTTCGAGCTTCTTTTTAAAGTTCCGAGCTGCCTCCGTGATCTCTTTTTCCTGGCGATCCATTTTCAAAAAAAAAATAAAATTGAGTTCAAAATTTGGAGAATAAAAAATTAAAAATAAAATTAGTGTTTGATAAAAATGGGCTATACCGCTATAGACATATAGCCGATAAGCACGTCGCCAGTCCGTTGACCGTTGGTCGGTCACAAAACTGACGAAAAAAACGCCACCGTTGACTGTCCGCCGTTGACCGTCCATCGGCCAAAAAAAAATATAAAAATATAAAAATACCTTCTTTAATCCCTCTTTTCGATAAAACAAAAAACCTTTTTTTTATTCTCCAAGGTTTGGCATTCTAGTTTTTGTCTTTTGACCAATGGCCGGAAGAGGAAAGACGACACCGATAGCTCAAAAAACGATGGCTCAAAAAACGATGGCTCAAAAAAGTCAAAAAGCGGAACTCCAACCAGTTCCGGAGACCGATGAAAAAACGAGGGGCACGGACGATGCGAACAAGCACGTGGATGAACTTTTGAGGGTGCTTAGTGCTTTGGAGTGCGGTCATTCCATCCTCATCTTGCAGGATACGGATTCCAAGTTGATCCGAAAGTGGGGTGGCAAGGATCTCTGTGCGATTATGGAAAGTCGCTGTGTGAATAAATTCGCATGGACGCCCACGGCCGATAAAGTGGAGAATACCTGCGTGACCGGGACTTGGGCAACCTTTCTCAAAAACGGCAGCCATCCAATCTGCCGTTTCAGTGTGACAGATGAAAACACGGTGGAAAAACAAGTGGAACACATTCGAAACGTACACGAGCCCATTTACAAGCTGCTGAAGGCGATCCCCTTGGTGGCATGGATAAAAACGCGAAGCAACGATGCCGAAAAAAAGTACGGCCTTTCGTGGGATGTAGGGATTGGAAAGGGATTGGCCATCAATACGAGTAAAGACAAAAAAAAGAAAGGAAGATACTGGAAAAAGATTACCGACGGAAAGTGCGTTGAACTGCCGCTTGAAGGCTTCAAAACTGCGATCTTGCACCTGTGGAAAACAACCGCAAAACCCGCGTTCAAGAAAGGGCCGAATCCCCCCAATGCGGAAGACCTTGGTAACTTCCTCCAAACCACCACTCACGTGACCGTAGTGCTTAAAAATTGGTCTCGGGACAGCCCCACAATGGAAGAACTTCTCGCTATCATGGACCGCGAGGGGGAAAAGCTCCAGGACAAGCAATGGGTAACTCAAAAATCTGCTATTAATTTAAAAAAAATGTAATTCTAACAGTTATAGATAGTTTCATTGAATTGTATAATTCCTTTGTCGTTGCTTTTACTTTGTTTGTTGACATTCATACCGATTGCGTGTTAATGTGGTTTTTTATAGAATAAAAAACCTAAAAAACGGCAGAAGAGGACGAGGCCGGTACAAAAAAATCCGACAGCGACATCGGAAACAGAGACAAAAAAACAACGGCTCAAAAAAAAATAGAGGAAAAAAATTTGGGGTGTATTTTAACAATTATTCAATAAAAACTCCGAAATGGAACCGTTAATTACTGGAATTTGTAAACAATGAATCTTGGTACTCCCTAATTTTGGATATGGTCAGAGAAGAGAAAAAAAACAATAATTGTGAAGAAAATATTAAAATATTAACGGTTTTTTATTAGAAACATAATTTTAAATCCTCAATCCTTTTGTGATGTCTCAACGCGGACTGTTTGTCCTCCGATTTCGCGGGCAACGAGTCTTTGTCCTCGGGCAACCTCCTCCGCCACTTCCGTCCAAGGCATAGGGTGGTTTGTACTCGGTGTGTACTCTATTTTTCTTAACTTCGTCTTCTTCTTCTTCTTCTTCGGTTTGGTGTCCTCCCGATCGGGGTCCCGATTCCGCTTCTTCGTCTTCGTCTTCGTCTTCGTCTTCTTCTTCTTCTTCTTCTTCGTTCCGTTGTCTTCGACCATGACCTCGTCAATTTTGTCTTCCTCGGGAGAAAAGTGTAAGTACTCCAATTGCTTCTGCGATATTTTTCCCCCGTAATGGTTCCTAAGAAGCGCACGCATGTGCCTTTTCGTTCCATACACGGATAATCCGAGGGGTAAGCTCAATGTAGAGTTCGAACTACTGGCCCGGGTCGGACCGGATTCGATGTTCTTTGCCGCGTCCATCTCGGCGATGTCCAGAACGTGGATGCCTTGGGTCTTGGTGATGTATGAACGGCCATCTTCTGCTACTGTGTAGCTCATTTTGCCCATTCTGCCCTTGTTGGCGTCCCATGCCACGGTCTTTTTGTTTTTTGTGTAGCGTTCCTCAACTGTTTGCATCGCCTCTTCCTGCACAGCAGCCCTTTCTTCGTCGGTTGTGCAGCCTGCAGCCGGAAGCATTCGTTGGACCTGTGCGTTCACGACTTTTTTAGTGAACTCTGAAGCCCGAGCTCCGAAGTCTTTAAACTGCATCACTGTGGCATCTTCTTTGGTTATTTCTTTGTGCTCAGTGCTCATGTGCTCCTCTGCTTTTTGCTTCGTCGAGAAGGTCTTTTTTTTATAAGTTGCTCGGTTTTTGCATATTTTCAAAGTTGTTTTCAAAGTTGTTTTCAAAAGTTTACCTTATTTTCGCAGCCGCAAGAGACTCCCGCCCAAACTTCAGTTGCATACTTCATTTTGATCATGATGGCCGTCTTAGGTTCCGAAGAACCAATGTACGAATCGAGTAAAGTAGTGGTCTTCTTCTCCTTGCCTGTGTGTCCGCACGCGCACTTCACCGTCCAGGTGTTGCCTGTTTTCACGCACGCCGGTGAGGCGTGGAGCTTTTCCAGGAGATGCGGGACACATTTAAAAATAAATGTCCGTGTCCCTAGATGGACACATAACGGATCCTTCAGCTGAAACATCTTCAAAAAGTCGTCGGGATTGAAGGCTGGATACAGCATGTAGCACCAGGGCAAGAAGTCCCCGGGTCCGGCGGCGTTCACCGAGGGTCCATTGTGCTCAACGCATCCGCTTTAGGTTTTTAAAATGAAAAGACTTGTTGAAAAGGCGTAGGAACAAAAATTTACAAAGTTTACAAATTTTACGTACCAAATGGAAACGCACCGGAACGTAGCAGAACTAAGTCCCGTGAATAGGAGAGGATCGAGGGTCTTGGTGGTTGGACATTCCGTAATGGCGATCTCAACTCCTGAGAGCTTCTCCGGACATTTTTCGAGCTTCTTCCGACAGTTCTGAACTGCCTCCTTGTACTCTTTTTGCTGGTCGGGATCCCCCTTTGTCCCCTTATCCTCCAAATCACCAAGTGCACTCAAAAATTCATGCACGCGACGGGAAGTGTCGTTCATTTTTTCATCGGTATGGGTCTCCGGAACTGGTTGGGGTACCGCTTTTTGACTTTTTTGAGCCAACCTTTTTTGCGCCATCGGGAAGTTCTTTCCTCTTCCGGCCATTGGTCAAAAGACAAAAACTAGAATTAAATTGGTGTTCAAAACCTTGGAGAATAAAAAAGAGATTTTTTGTTTTATCGAAAAGAGGGATTAAAGAAGGTATTTTTATATTTTATATTTGTGTATTTTTATATTTTGTTTTTGGCCGATAGACGGTCAACGGCGGACAGTCAACGGTGGCGTTTTTTTCCGTCAGTTTTTTAATCGACACCGGCCAACGGACGGTCAACGGACTGGCGACGTGCTTATCGGCTATATGGCTATAGCGGTAAATATAGTGGTATAGCCCATTTTTATCAAACACTAATTTTATTTTTAATTTTTTATTCTCCAAGTTTTGAACATCAATTTTAATTTTTTTTTTGAAAATGGATCCCCATTGGATCCAAGAGTGCGCTGGCAAGGATCTCTGTGCAATGATGGAAAGCCGCACCCTTTACGATAACGATTTGGGAGCGGTCACACGCGGTCGACTGCACCCTGGGTGCAGTCGAAAAAGTGAAGAACTCTTGAATACCGAAAGTCGGTATATCTGCTTGACCGGGACTTGGGCAACCTTTCTCAAAAACGGCAGCCATCCAATCTGTCGTGGTTTGACAGATGAAAACACGGTCCAAAAACAAGAGGAACATATCCACAAGGCACACGAGCCCATTTACAAGCTGCTGAAGGTGATCCCCTTGGTGGCATGGATAAAAACGCGAAGCAACGACGCTGAAAAAAAGTACGGCCTTTTGTGGGATGTTGGGAATGGAACCGGCCGCGCCATCAATACGAGTAAAGATAAAAAAAAGAAAGGAAAGTACTGGCCACAGATTACCAACGGAAAGTACGTTGAACTGCCGCTTGAAGGCTTCAAAACTGCGATCTGGCACCTGTGGGAAACAACCGCAGAACGCGGGGTCAAGAAAGGGTTGAATCCCCCCAATGTGGAAGACCTTCGTAACTTCCTCCAAACCAACACTCACGTGACCGTAGTGCTTAAAAATTGGTCTCGGGACAGCGCCACAATGGAAGAACTTCTTGCTATCTTGAATCGCGAGGGGGAACAGTTCAAGGACAAGCAATGGGTAACTCAAAAATAATTTAAAAAAAAATGTAATCTCTCCTAATTATGAATTAAAATCCCTGTTGTAAACGTAGGTTCATTGAATTGTGTAATTCCTTTGTCGTTGCTTTTACTTTGTTGACATTCCTACCGATTGCATGTTAATGTGGTTTTTTATAGAATCAAAAAAAACGGCCGAATAAGAGGACGAGGCAGGTAGCGACATCGGAAGAGGGAAAAGGAAAAAAACAACGCCAAAAATAGAGGAAAAAAAAGTTTTGGGGAAGGGGGAGTATTTAACAATTATTCAATAAAAACTCTGAAATGGAACCATTAATTAATTACTGGAATTTGTAAACAATGAATCTATTATTGTCTTATCTTGCTCAATATTCACGAGACGATAATACTTGGTCGCAGTAGTATTTTATGCAGGATAACTTTGTGTTCTAGTTCGAGCACCGTAGTTTGCTGACC
>k141_20768437
ATTCAAGCTGAAACCGCCGGAGGAAATATCGACGCTAAGTCTACCATGTTCTTTTAATATAAAAAACTCCACCCACTCGTCGTTGGTTTCGTGCTCCTCCAATAAACGTTAAAACAAAATGATTGCACCTAACAGACACGCAGCGTCATGGTGAAGCGCGTTCTAATAAAGCCGATTAGACCGCTCCCCCCTCCGTGGAGTGCCCTGGAGGCAATCGTTGTGATCAATTTGCTCGATCGCGAAGATCGACTCGCGCACGTGCAGAAAGAACTAAAACTACACGGCTTGGACGGAGCCGCTTATATTCTTCGCAGCCAAAGAGAACAAAATGATTTTCTTAAAGGATGTTACGATTCCCATAGATATGCCACTACCCTTGGTCTTATCAAGGAATGGAACCGTGTACTAATATTGGAAGACGATTTTGTCTTAGATCAGAATTGCGGTCTCCGTATCGCAGAAAATGTGAACGTACTTCCGAAGAATTGGATGCGGTTGTTGGTTGGATATATTCCGATTGCGCCGTACTACGATTTTTCCTCGCAACTTTGGAAAGGCCTAACGTTGTGTTGTACCGGGTACGTTATCTCGAAGCAATATATGGATTGGATGCCCGTTTGGGAAGACGTGAGTACTATTTGTAAACCGTTTACCGTAAATTGTGTGAGTTTGAAAAAAGCCGACAATGGACTCGACCATGTTATGACTTATTTGACTCGCCGACGCACATATTTGGTTTTCCCGGCTGTCGTTTATGTCAACGGAAAGCTGAAGTCGGACCACACAGGCAATTTTTGGGACAGGTCATGTCACTCCCTGTCCAAGCAGAAGTTTTTACAATTTCTTTGGTTGATTGTGTATATCTTCGTATTGATTTCCGGGATAGTAATAATTGTACTTCTCAGGCGGGTTACCGCTTGACCGCCATACGAAACGTATTTTCGCACGGGAAGGCTCGGACGAGCGAAATCTTCCAAAATAATAAAAGTGAAAGTGAAATTGAAGTTGTGGTCATCTTGGTCTCGCATAGTACTTCAAATAAAGTAAAGTATGATCGGTGCGGTTCAGACGTACTGTAACCTTTTAGACCCGAATGTACAGGAAAGCGTTCTGGTCACTGCTACGGAAATTTTGGTTGCTGGAACGGTCCGGTCTTCCGACGCGCGCCTGCTCGCGGCAGTGTCCATTTTTTACTCGAGCCGTTACCACGACAGGTATCGAACTTTGGATATGATTGTTTCCGAGATGGCTGAGACTCCAGCAGACCACAAAACAACAACGAAAAGGGTTCGGAAACTTTTAGCAAGACTCGAAAAATTGTGCACCACCTCCCCGTTCCGGCTTACTTGTCGTACTCGACCTTCTTACTGGAAACCCCTCAGCGTGGACCTGGTCACGCATGCCTGCGTTAGACTCGGGTGGTCGGCTTCTGTTCGGAAAGCGGCTATCACGGTGTGCATGGAACTCCATCGATCTGACTTTGGCTTCTCTCTCTTACCTGTGTCAGCTGCAGCAGGCGTTCTGTATCTGATTGCGCTGCAGTTCGTCCCAGACACAGGTGCACGGGAGATTGCGTTCGTTCTTTGTGTTCCTACCGCGACAATAAAAGTTGTACATCGCCAACTCAAAATTCGAAGAGTGGCTCCCCCTATTTTGCTTTAAAATAGGGGATTTGACTTTCTGGAGTTCGCTTCTACGTGCTGTACTTCCGAAAAGGTCGAAAAAGATAAAAGATAAACAGGGCTTTCCCGTTTCGCTGAAAGTGGAAACATTTCTCAAAAAAACAAAGTAAGAGATGTTTTTTTTCAAGCAAACAGAAGTTGACATTTCAATTTATTTGCATTTCGATTTTCGTTCTTCCTTGAGACACCTTTCAAATCTCAAATTTCTTTTTTTTTGCGAAGACAAAAAGAAAAATTAAACTTCAAAAAAAAAACGAAACGCTTACGATGCACACCCCCGTTTTCGAGTCGGAGAAGCAATGTGTAACTTTACCGGACGGAACCGTGCTGTCGGACGCTTTCGTTGTTGAAGATTTCATGCAGAAATTTCTCTCCCTTGATCCCCGTCGTATTTTTCAAGACCTGCTGGAATTCGAGTGTCACGACACAAAAGTCGAGATCCCTTTTTCTCCTGAAGGAGACTCGGTGAAATGGGTGACGGGAGAGCATCCGGCTTTACACTATCGAGGGAACGCTCTGAAAAGGCGTAAAATGTGGTATCTCGGGATCTATCGTGGCTCGTTGCATAAAGACGGTGGTTCCTTGGCCAGTTGTTTGGGAGAACATCGCGAATGCGAAAAAAACTAAAGCGCAACGTTCGAAAAGAAAAATCGAAAGAGAGGAACAGGAAGAAAACGAAAAGAAAAAGCGTCGAAACTCGAAAAGCAAAAGCATGTATTAACATTACACATTTTGCTCACAAATGGTTTCTCTCCGGTGTTTATTAATTTGGAATACTAATTTTTCATTTTCTCTGCCCCTGACATTCATGCTTGGGGGCGACTGAGTTGAGACCGATTTCGGTCTCTTTTTTTTTGTCCGCATCCAATCTCATGTTTTTTTTCTTTTGGATTTTCTTTTTTGCTGGTTTCTTCGCCGCGGGCTTTTTGGACGAGGCCGTACTGGCAGAAATAGTAACTGTTTTAACGGGAGCACGAATCTGTGGACGAACGTGAACTCTTTTTCCGTTCGTGGAGCGAAAGTGCCCGCGGGCTTTATTTTTTTTCGTAGTGGTTTCGCGGCCAGTGAATTGTTTCGCACATGTCTCCCAGTCCGTTGTGCCCATGCTCGCGTTACATAGAGCACAGATTGCTCTCCCGGTTTTCCCCTCCTTTACATTTTGCGAGACGGTGCCCAACGTGAAAGTCCCAACGTGAAAGTCCCACACGGTAACCTCTGTCGAGCGACAGCACGGGCATTTTGAGCCGAACTCGTTGGACCATACCTTTCGGCGCAAGGCTTTAGAAATAGTCATTCTTTTGTGGTGGTCATATACAGGTGCGTACCCTACTCATTTTTGTTGTAGAAGACAATAAAGCGCTCTTTTTCTTTTATTTTTTATTTTTTCTTCGTGCTCTACGTAAAAAAAGCGTATATGGTCCGAACACTTTCGCAGAGCGGATGCCCCCTTGAAAGTCCTCTTAAAAAAGGATGGCTAGGGCTGCATGGACGCATCGACGAGCAGTGTGCCAGAAACTTCTGGTTAAGCCGTGTGCGCAGATCGTACCGGTCGTACAGAGATTTGGACCACCCGGAAAAAGCGCTGGAAAAAGCCCTAAAGGACAACGAATCGACGCTCTCCGCCCGGTGCGAGGGAACCTTGGACCTGGAGAAACTGTCCTGCAGCGGATATTGGTGGGACTATAAATGGAGTCCCTCATCCGGAGCAGTTTCTACGTCGCCGCAGTGGTCAGTTTCCGAATGGAAAGATATCCAAAAATTCGCCCCGGAGGGCCTTCTTGAACGTCTCGGAGCCGCGCGTTTGAAAGCAGAGGAAGAACAGAGAGTGCGACAAAAGGCCGAAGCCGAGCGTTTGATAGCAGAGGAAGAACAGAGAGTGCGACAAAAGGCCAAAGCCGCGCGTTTGAAAGCAGAGGAAGAACAGAGAGTGCGACAAGAGGCCAAAGCCGCGCGTTTGATAGCAGAGGAAGAACAGAGAGTACGACAAGAGGCCAAAGCCGCGCGTTTGAAAGCAGAGGAAGAACAGAGAGTGCGACAAGAGGCCGAAGCCGAGCGTTTGATAGCACTGTTCAAATTATCGGGAGGAGAAGAATCAATTCCTGGAGAAAAATATTGGGCTTGGCAATTCCTGGAGGCCACGGAAAAAGAAGACCCAAAACGCAAAGAAGAAGCAGACCCAAAACGCAAAGAAGAACGGGAACGTGAGGAAGAAAAACGGAAGCTCCGCTCCATCCGAAAAAGTCTGGCTCTGGAATACCTCTCGCCCGAAGTTCTGAAGAACGTGCTAGAGCGCCTCCCGAGAGACGACCCGTTGCACACAAAAGTGTCTGCCCGACTTAAGTCGGCAGAAGAAGACGCTCGCAGAACTCAGTCGTTCCAAGAGCGATTGCGAACTTGTGAATTCCTTTACTCGGAGTACGCAAAAACAGCCGCTTCTCCGACGTCACCTCAACAAAGAAAAAAACTCGACGAACGACTGAGTACCTTGCAGTGCGCAGAGATAGCAGAAACAGCGTGGAGGGACAAGTGCCGGGAGTGCCACCGGGCACCTGTACCTGCACCTGGAGAGTGCGACCAGTTTCCTCGGAACTGTAGGGTGCCGACGACGGCTCGACCGCTACGGTGGGCAGACGCCGATATTTTCCAAAAGGTAATAGAATGCGACTCGAAAGCGAGACTCACTAATACTCAACAGGAATTTTACGTAAAGCATCTCCGCCAGCTCGTCAAAGGTACGGCCGAAAACGTGGAGGACGCCCTTAAGAAATTCCGTCCCGATGTTTGCCTCTTGGGTCAGGGACGACAGGAAATCTACATCTTCATTTTAGCTTTATTTTTCACTTACGCTATTCTCCCTTCGCTGAAGGCCCGCCAGCTGTGTCTCATTCCCACTGTTGAGTTCGAAGTCTTCTCGGACAGCGAAATCGAAAACGTTTCACCTCAACTGTTAGATGAAGATCGCAAAGCTCTCGTGGCTTCCTGTGTGACGAGAGCCCAAGAAAAAATCAATCGATTTTTATATCCCGTACTTCCTAAGTGGAACTGGGCGAGTTCGATGACCAATTGCCTCAGCTTCGAGCTTCAAATACTTCTCGATTATGATAATGATCTCGACTTCTACTTCACTGAACAAAGGGACGTCATCGTGGTACCTCGGCTGTTCCTATATATCCGGAAGCAAAACGACAAATGGACGCCGGAGCCTCACGATTCTAGCTGGTTCGCGCGATGGGAAGCCAAGCGAAAGCTCGAACAGGAAATCGTGGACTCTCTGAAACCGTTAGCGCAAAAGAGTTTTAAGTCTCTCTTTGAAGAAACTAAATCTCTATTTCCTTCTTTTGCTCGTTACAACCTCGTAGTGATGCTCCCCAGACATGCGACCTCGGCCATTTTTCTGCCGGACAAGAGCGAGATTTGGTACGTCGATACCGGCACCCGGTCGTCGTGCGGACAGGCCTTAGGCCCGGCCACAATTGTCCGGAACGTCCTGCAGGTTTCTCCTGAATTCCGTTTTTTCGCTCCAGGGTGCAGCCACCTCGCCTACAGGCAACGAGGCCCTTCATGCGCTCTTCATTCTTTGTTCTTTTTCTGGTACGGGTGCTGTAACGACTTGGCCGCGGTGCAGAGATTGTGGAATTTTCAGTCGCTGTCAGACTGGAGAAAACACAAGGAGTTAGTGAACGCGCATCGCCGGGCTCCGTTTTTGTGCCCA
>k141_7723568
TCGGAACTTTAAAAAGAAGCTCGAAACCTGTTCGGAGAAGCTCTCCGGAGTGAAGATCACCATTACCGCATGTCCCACCACTACGAGCCTCAATCCTCTCATGTTCAGGGGACTGAGTTCTGCCACGGAGGGGTGCGTTTCCATTTGGTACGTAAAATTTGTAAACTTTGTGTAATTCCTTCTGTTCCAAATTTCCCTACGCCTTTTCAACAAGTCTTTTCATTTTAAAAACCTAAAGCGGATGCGTTGAAAACAATGGGCCCTCAGTGGTCTCCGTCCGACCCGGGGACTTCTTACTCTGGTGCTACATACTGGATCCAGCCTTCAATTCCGACGTCAATTTTTTGAAGGTGTTTCAGCTGAAGGATCCGTTATGTGTACGTCTCGGGACACGGACATTCATTTTTAGATGTGTGGTGCATCTCCTGGAAAAGCTCCACGCCTCACCGGCGTGCGTGAAAGCAGGCAACACCTGGAAGGTAAAGTGCAACGACGGGTGTGCACACACAGGCAAGGAGAAGAAAACCAGTATTTTGATCGATTCGTACATTGCTTCGGAACCTAAGACGACCACGACCGAAAAGAAGTACGCAACTGAAGGTTGGGCTGGAGTCTCTTGCGACTGCGAAAATAAGGTAAACTTTTGAAAACAACTTTGAAAACAACTTTGAAAATATGCCAAAACCGAGCAACTTATAAAAAAAAGACCTTCTCGACGAAGCAAAAAGCAGAGGAGCACATGAGAACTGAGCACAAGGAAATAACCGTCCCAGAAGATGCCAAAGTGATGCAGTTTAAAGACTTCGGAGCTCGGGGGGCGAATTTCACTAAAAAAGTCTTGAACGCACAGGTCCAACGAATGCTTCCGGCTGCAGGCTGCACAACCGACAAAGAAAGGGCTGCTGTGCAGGAAAATGCGATGCAAACAGTTAAGGAACGCTACGCAAAAAACCAAAAGACCGTGGCATGGGACGTCAACAAGGGCAAAATTGGCAAAATGAACTACACAGAAGATGAAGATGGCCGTTCATACATTACCAAGAACCAAGACATCCACGTTCTGGACATCGCCGAGATGGACGCGGCAAAGAACCTCGAATCCGGTCCGACCCGGGCCAGTAGTTCGAGCCGGGCCAGTAGTTCGAGCTCTATGGTGAGCTTCCCCCTGGGATTTGTCGCGTATGGAACGGGAAGATCAATGAAGGCGGTGTTTAGGGAGTTTTGCGGGGGAAAAATATCGCAGAAGCAATTGGAGTACTTACACTTTCCTCCCGAGGAAGACAAAATTGACGAGGTCGAAGACATAGTCGAAGACAACGAAGACACCGAAGACACCGAAGACAACGGAACGAAGAAGAAGAAAACGAAGACGAAGACGAAGAAGCGGAATCGGGACCCCGATCGGGAGGACACCAAACCGAAGAAGAAGAAGAAGAAGAAGACGAAGTTAAAAAAAATGGAGTACACACCGAGTACCCACCACCCTATGCCTTGGACGGAAGCGGCGGAGGAGGATGCCCGAGGACAAAGACTCGTTGCCCGCGAAATCGGAGGACAAAAAGTCCGCGTTGAGACACCACAAAAGGATTGAGGATTTAAAATTATGTTTCTAATAAAAAACCGTTATTTGTATTTTCTTCACAATTAGTTTTTTTTCTCTTCTCTGACCATATCCAAAATTAGAGAGTTCCAAAACTTCTAAAAAAATGAATCGCTTGTTTTCCCGGAATAAACTCCTGAAAACTGAATTCAGAGTCTTCCCGTAGAACCAATACAATGAGCATGGACTCTGAGACCCCTATAAACGACGAAGCGGTATGGCACGTCGCAGAAGCGTTTTTCAAGAAGTTCGGTCTGGTGTATCACCAAAAGGAGAGTTTCAATTCTTTTTTTCTTCGCTCAATTCCTGACATAATTCACGACAACATGCCAATTACGTTCGGGAACGGCCGTTATGCAGTTGAAATGCAAAACCCTCTCTTCCATGCCCCGTGTGTCGAAGGTGAAGGCACCGTTGTCTACCCGATGCAATGTATAGAAGCAAATCGTACTTACCGTTCTGAGCTTTCGGTAGACTTAATTGTACGGGACTTGGCAGACGGGTTAGAGAAGAGTCACCGGGCGGTTTCACTAGGGCTTTTTCCTGTAATGGTAGGGTCCGTCTTCTGTAACCTCGTGCAAAGAAATACAACAGAAAAACAGAAGTACGCTCTGAGAGAATGTCCGTATGACGAAGGAGGGTACTTCATCGTCAAAGGAACCTGTAAAGTCCTGGTCTGCCAAGATCGTCCCATGTCATGTTACAATCGCGTTTATGTGTTCAGATCACGTAAATCTCCGAACTATGCTTATTACGCGGAAGTCAGAAGCATCGCACCCGGCCGAGCCGGCCGAAGTACCACCGTAGTAGTGGGCCTTACAGAGAAGAAGAGCAACGTTCGACGTCTAACACTCTCAGCGGTCATTCCGTATATGTCGGACAAAACCCCGATTCCTCTCGGAGTCCTTTTCAAAGCGTTAGGAACTAAAGACGAACAGGAGATAGTACGAACGATTTTTACGAACGAAGAGCCGTCAGCAGCCGCGTTAGCCTTTCTGCGAGGAACTCTTGAGCAATCGTACGGCTGTGCGACCCGAGAAGAAGCCCTGACTCGTATTGGTAAAAACGGAAAACGTCACTTTTCGGCAAAAAAAAGTCCGGAAGGAGATCGTCCGCTAAGTTCAGCTCAGGCCGAAGCCACCATCCGCAATCAACTATTTTTGCATATTCAGCGCTCGACTGAGAAGAAAACTTGGGAAGCGAAACGCTTCTTCTTGGGGTACGTCGTCAAGCGGCTTATCAATGTAGCTTTAGGAGTCGAGAAACCTGACGACAAAGACCACTACGCGACGAAGCGCGCCACTACGCCAGGGATGTTGCTGGAGCGACAGTTTTCGCGGGACTTTCGCCGTCTTTGTAGCGACTTGGTAAAAGCGGGGGAAACGGCCTTGGAAAGGAAAAACACCATCGATGTCAAGACCTGGGTGAAGAATAAAGCGATGAATATTACTTCGTCGATGAACTATTGTATAACAATGGGAATGTTTGCTGGCAAGATGATTGGAGTCAGTCAGAATTACGATCGTTTCAACTTAATCGCTTCGGTGGCTAATGCGCGCAAGATTTCTACGCCCATCAACGAAAGCGGTAAAGTCGTTGGTCCGCGACAGCTGCATGGAAGTCACTGGGGTATATGCTGCCCGTACGCTACGCCGGAAGGAAAGAAAGCGGGTCTTCTTAAAGATCTAGCCCTTACTTGTCGAATCACGGTAGGTGAGAGTGCGGAAGGATTGAAGGAACTTTTACGTCTCGACCCGGAGCTAATAGACCTAATCATGCCGTCTCGAAGACACGGCCACAGCAAAGTCTTTGTCAACAGTGATTGGTGGGGGTGGACGCGGGATGGGTCGGCGATGGCGAAACGGTACCGAGCGCTACGCCGGAAAGCTGGACTGAGTCCCCTGACTGGCATCTCTTACCATCCGCTCAGAAACGAGGTGCGATTTTCAACGGACCCTGGACGGTTCTGTCGTCCGCTCTTCGTTGTTGAAAACGGACAACTCCTTTATTCGACGAAGCACCTTTCTATTGTTCAAACTGAGGGGTGGGATGCTATAATGGATGAGGGGATCGTGGAATTCGTGGACAAGGAAGAAGAAGAGTTTCTAGTGGTGCAGTATTCGCCTTCGTCACTCGCTCGTCTTCGAGCCGACGAACAACAGGTCGTAACGCACTGCGAGATTCATCCTTCCCTTATTTTTAGTGCCAGTGCGTCTGTGATACCGTTTCCTGATCGTAACCAGGCTCCTCGCAATTCTTACGCAGCTCAAATGTCCAAACAAGCTGTCGGGATTCCTGGCTTGAACTACTCATTTCTTGTGAAAGGTACCTACAACGTCCTAGACATTCCGCAACGGCCTCTCGTGGAAACGAAGGTTGCTTCGTTGCTGGGTTTCTCTAACCTTCCCGCTGGAGTGAATGTAGTGCTCGCTGTGTGTTCATTCATGGGATATAACCAAGAAGACTCTCTAGTTTTCAACCGGGCCTCTCTCGACCGGGGATTGTTTGGGATCACCCGCTTGCTGACCTTCTATGCAGAGGTAAAGAAAACCGAAGGAGAAGAGTTTGCTGTGCCTGAAAAGCTACAGGTTTCAGAGGGAACCATCGTCGGACGAACCGCTGCTGGGCAAAGACGGACGAGCAGCGACGTTACCCAACAGAAGGGGACGAAAGCTCTTTTGTTTCGCCCTTGCTGTAAAATCACGGGCAACGCTGCTAAACTAGATCCTCAGCTGTGCCACGTTCTTGCTCGAAAAACCATAGCTTGTCGGCTTTCAAAAGGCGGCTCAGTCGTTACACGAAGTACGCTCGTCGAAAAAGGAGATATTTTGATAGGGAGAATTACGAAAAATGCACCTGGAACAATTTACCCCGAACCTTATAGAGACGTAAGTATAGTTTATACAGAGACGCTTCCGGGCCATGTTCATCGAGCCGAACGTGGAGTGAACGCGTCTGGATACGAATTCATCCGGGTGGTGATCTCTCAGAAAAGAGGCGCCGCAATAGGAGATAAATTTGCTGCGATGCACGCCCAAAAAGGTACTCTCGGGAAAATCGTGGACCCGGAAGACCTGCCTTTCTGTGCTTCCGACGGCATCATTCCGGATGTGTGCATTAATCCTTTAGCTTTCCCTAGTCGGATGACAGTTGCAATGTTTGTGGAATCTTTAGTCGGGAAACAAGTTGCTTTGTCCCCCAAAGCCCGCAAAGTAGGAGCTCACGAACTTTTTATCGGAGATGGAACGCCTTTCGAACGGCTCGATCTTCAAGAAGTTGAAGCCGTCCTCACCAAAAATGGGTATCAATGCCGCGGAAAAGAGTTCATGATTGACGGAATGAGCGGTCGGCCGTTGCCTTGTAGAGTTTTTATCGGACCAGTTTACTATCAACGGCTCAAACACATGGTTGTAGATAAAATTCATGCCCGGGCTAGAGGGAGCCACACATCCATCACTCGCCAGCCAAAAGAAGGCCGACAATTTGGGGGAGGATTTCGAGTGGGCTACATGGAAAGAGATAATTTGGCTGGCCAGGGGTCGGCAGCTTTTCTTCGCGATCGTCTTTTGGAGAATTCGGACGACTACAAGATGTACTTTTGCTCAAAATGCGGTTTACCTGCCGTTATGTCACGGACAGGACAAGGCGAATGCACCCTGTGTAAATCTCGAGACGTGAAAAAGGTCAGGCTCCCGTACGCAACAAAATTACTTCTACAAGAGCTGAACGGAATGGGGGTGATGGTTCGCGTAGTGCCATCAACTTTTGGGACCGAACACCCTGAAATTGAACCTTACCAGGGACCTTCCTAAGCGCCCTTTGCAATGGGACTGAACTGCTGGAAAGTCGACAACGAGGCACCGGGGTTTTGCTCGACAATAAAAAAAAGAAAAAAAAGTTGCGGAGCTCAAAGCGCAGCTTGCGCCCATACGTCTACTCTCCGGTTACGAGATTTCGGTTGTTGAACAATAATGCTTTATTTTTACGAACAAACGTCCCTTCATTTAAAGAGAAGGACATTCATAGTAATGACCACATGCCTAAGGTGTGTGAGTACTTGAATTGCAGAAAACGTCCATCCTACGGTTATTTTTACGGAGAACCTAAAAGGTGTTCTACTCACGGGCTTCTTTGTAAGATGAAGCCTCAATATGCTATCTGCCGGTGCGGGAAAGCTCAGCCTATTTACAACGAACCGGGAAAAACACGAGCCGTGTGCTGTTCCCGATGCAAAACCGCATCGATGGTTAATGTCAAACACAAAAAATGCAAATGCGGGAAAGCTCAGCCTATTTACAACGAACCCGGACAAACACGACCCGTGTGCTGTTCCCGATGCAAAACCGGGTCGATGGTTGATGTCAAAAACAAAAAATGCAAATGCGGGAAAGCATTTCCTATTTACAACGAACCGGGAGAAACACGAGCCGTGTGCTGTTCCCGATGCAAAACCGAATCGATGGTTGATGTCAAAAACAAAAAATGCAGATGCGGGAAAGCTTGGCCTATTTTCAACGAACCGGGAAAAACACGAGGCGTGTGCTGTTCCCGATGCAGAACCGAGTCGATGGTTGATGTGGCAAACAAAACGTGTCCGGGTCAGGGACCGGGAATGTGCCCTACAGTAGGGAACCCAAAATACAAAGGCTATTGTACACATTGTTTTAGTCACTTATTCCCCACTGACCCCCTAACTTTTCAAATCCGCTCTAAGACAAAAGAAATTGCTGTGCGCGATTTCATCAATTCGGTATTTGAGGGTTTTACGCATGACAAACCGCTATGGACTGGATATTGTGACTGCACCCATAGACGAAGAATTGACCACCGGAAACTAATCGGAAATACGATGTTAGCTATAGAAACAGATGAACATCAACACAAATCTTATAAGAAAATGGACGAAGAAACTCGGTATAACGATTTATTTATGGCTTTTTCAGGGAAATGGATCTACATTCGATTCAACCCGGACAAGTATATTTCCAAGTCAGGAAAAAATAAAAATCCGTGTATTGCCACCCGTCTGGAAAAATTAAAATTGGAAATCGAAAAACAAATAGCACGCGTTGAAAAGGGTGAAAATGTTGAATTAGTAGAGAGGATATACATGTATTACGGAAGATACAATTAACCGACCATTTCAGAATATACAGTATACAGATTGTGGATATAGGATAACAGAAGGACCGAATAAACTGTCCTCGACTCCGAATTGATTCATGTTGGTGTGGGAAAAGTGTCACTTCGCGCTCATTTTTTTCCTCTTCCCTTTTGGTTTTTTACCGGGCCTCTTTTTCGCTGCCATCTTCCAGCGAGTCGCTGGTGTATACCTATCTATTTTAAGATCAATCATCTTGTCTGGTCCGGGCTTGGGCCAACTATACTTAACGACCCGGTTCAATAAATTGCTATTCCTATTGACAATGCGAAATCCTACAGCAGGCGGATCCCAGCTAGACCAGAGATCGCAAGACGATTGTACATGATCCTTATTTCCTTTTGCGTACAAGCCAGTGTCATGCTTACGACTTTTCGGGTTGGCGTGCTCCACAAACCACGCTTTGGCAGTCGTTTTTCTGTATCCACTAACGTCAATTCCCCGTTTGCTCATAAAAAACGTATCCATCGCCACTGGCTCGATTTTACCCTTGGTGACCGTCCACATTTCGTAGTATTTGTCATATCGGTGAGCTTTGCCTGACCCTATTTTTTGTGTTATTTGTACATACTGAAACAAAAACCCAGTTGCGGACGGTGTAACCTTAAACCGTGCCTGAGTCCCTGCACGCCATTTTGTTTGCTTTATTTCTCTATCGACCATCGTTACTGTGAACTTCATGTTGTTTTGAGTAACCCACTATAAGTTTTCTGTGCTGATCAGTCGTCAATTCCGAATCAAAATGTCTGAGCGCAAGGTTCAAGGAAGACCCT
>k141_11312800
TCTTGGAGTTTTTCCTGTTTTACAACAACTGCTTTTTCCTGTTTTTGGAGTTTTTCAACGTCTGCCACTAAAATTGTGTTCTCTGCAATTTTCTGTGCAATTGGCCCTACGGGGGCGGCCGGTGGCGCGGCCCGATTACAGGTCTTGTCATTTTCATTATAGAGGCATTCTTTTTCCGGTACGTGTGCACATTTCTCAGGCTGCTTATCAAGGAGTCGACAATCGTAGTCCGCCCCACCCCCGCGAGACTGCAAGTCCAAGTGGTCCAAGTGGTCCAAGTGGTCCAAGTGTCCCATGTGGTCCTCCCGGTGACGCAGACGGTGCATTTTGTCCATATGGCGCATGTGGCTATCGATTGACCTTCTCCGCTCGGTCAGGTGACTTCTTCGGTCGCGTGTGCGGTGAGAACTTGACGTTGTGTTTCCCATTTTTCACTTACCTAACATTTTTTTTTTATCAATTTTACGTGTTCGACTGCCACATTTAAAATTTGTATTGTGTTGGGCAGTGACACACAGTGAATGAAGTTTCAAACCCGCCTTCTAGTGGCAGTCTTTATTACCGCAAGCTTAACAGTGTGTATGGTTGCGTGGATCGTGTCCGGACGCCATCCTGTGTTGCAAGATTCTTTTTATCCTCCACTGAATCCCCCGCTTCACCTTCCTCGACAGTTCGGCTTGGCTTACGAGCGGACCCCTCAACGTCTCGACCTTGTGGTTGTACCAGCAGACCCGCTGGGACCTCAATACCCTGAAGCGGCCACCTCTTATTTTTACGTATGAACAGACGCATTTTCCATTTTTAAGATTATTTATTCTTTACGAGAAGATCCAAAGAACGCCCCGTACTTTAGTGCGGGGAGGATGCCATAGGCCAAGACGAAACAGCAACACTTCCGGAATACACTCCCCCGTGCACCACGGGAGCACCAAGAAACTTCTGAAAATGTCCCTTGAGATGAAGGGTCAAGCCGTCCAAGATCTCGAGCGGCACGGGCGACCGGGATTTTCCTCGAACCCACAACTTTAACACGGGCGAGTCGTGGGATAACAATACACCTATAAGTAACGAAGCAGTCGCGTTCTTGCTTTCAAGAAACTCAGCGATCCATGTTCGGAGAAGCGGTGTGTGGATACATTTATGTCGCAGAACGCACCAGACGCCCTCCCGGGAGAGCATTTCATCGGGACTCAATTGAAAACGCAATGTGATGCAGTAGGACCCTGACAAAGAGTCTGTTTTTTTCATCATGCGCGACCGCGCAACAAGCAACCATTGGTATTCAATCGTTTGGACTTAGACTTTTGAATTCATTTTCCATTGTTTTATGACATCCTCCCCGCACTAAAGTACGGGGCGTTCTTTGGATTTTCTTGTAAAAAAAAAACGAGTTTCGTCACGCGTAGCCGTCGGTCGGAGAGCAACTCACGGAGCAGGAACCGCCGCGCTGTCAGTTTTTGAACAAAATTTGGATCTATCAACGGCCTTTTTCCCGCCAAGAAACTCAACCCGGAGATCGTACGGCGACAACCATTCATCCCATTTCGAGTGCCAGCCGTTGAAATGTACCCTTATCAAGACAGCGTCCCAATCCGTGTCGACGACGGTGGCCACCTTTTTTTTTTAGTTCTTTTGAGTTTGTTGTGCCGAGGGAGGGAGAAAGATACGTTTGTTTACCAGCCAGTGAAAAACAGTGTCTTCCACATCAAGCTCATCTCCAACTTTCAGTTCGGATCGCCATTTCTCTCTCTTTCTCTTTCTTTCTTTATCTTGGTCGGAGTCCGCGCGCCCTCCCTGCCATCTCTGCGCCAGCGGTTCGTTGACTGCTTTCCCGGCGCGTGCGGGAACAATATTCCGAACAGCGACGTCTACTGCACTTGGGTCCACCAAGGACTGGACTTTGTTTGGAATAGTGGGGTAACTCTTTCCCGATCTTGTCCGCATTACGTAGAGATCGCTGTCAAAACTTTCTGCTGCATTTTGGTCGACCGAGGACCGCGGCTCGGCAACATTTGCTCGGATGAAGCGTTTATACAGCGAGCTGGCATAGCTTTTTCCCGAACGTGTACGGATCCCGGAGAAATCGTCTTGCCCGTTTAAGAGAGCTCTTTCCTTCTCATTCAAACCTTGCGCGTCGTTGGTCATAGTCACTGAGTTCCGTTCGAAGATGAAAAAAAAAATATGTGTGATTCAATTTTGATGGGTTACGCAGTATTCCTCTCGTTCTTCTCGTTCCTTTCGTGGCAGAGCCAGTTGTCGTGCTTCGTTCGTTTGAAAGAGCTCGGGTCAACCCTCGAAGTCCCCGCCTTTTTTTTATTAATCGGTTAATCGTACAAGTTAATCAATTCTTTTGCGTACCAGGTTTCCGTGTCGCCATTGGGTAACTTCCGGATTATTTTGTTAGGAATCTTTTTCTCGATTAGTTCCTGAAGAGCTAAATCTAACGGACTTCGTCCTCCAAAAGGATCATCCACCAGAGGAACTTGTCCTACACTTATCTGCTCTGCCCTTACACCAAGGACTTTGGCTCTTTCATATCGGGTTAGGTACGGAGGAGTTCGTCGAGTAAATCGGTCTTCGCTCATTGGCAGCCTTTTCCTCGCCCATGCTCTTTTCTTTATCTATTTTTATAACTCTATTCATTTTTTTTGAACCGACGGAACTGTAGCCGGTAAGTTCGAAAAAGTTTCTCCTATACGCTCGGTGCCTTCACCGTGCTAAAGCACGGGGATTTGACTTTTCTTTATCCTTTCCAGCGTTCAGGTTAACGTGCTTTGCATCTGAAAAGGATAAAGAAAGACAATTCAGGGCTGGCAGCGGGGGTGTACCTCCCTGGCTCGCTCTACCTCTTCTTTTTTGCTTTTATGCAAAGCACCTAAGAGTATCGGTATCTTTCCTCGGACCGACTCGAGAGGAGACAGCGCCCAATTCCGCACAAACGGAACAATCCGGGCCACCGGTAGTTCTGTCATCCACTTAAGGGTGCCGGCGGCAATGAAGGAGGTCATTTATACTAGCCTCTCAAAATTTACTTCAGATGGTAGGTTCGACACAATTCTTTTCATTATTTTGTTCTATATAAGGGTATCCGGTTTCCCCAAAAGCTCACCCAGTGAAAAAAGTCTGTAGGACGCATTATTATTTTTGTTTCCTCTTTACGCATTCCAGATGGGTTAAAAAGAAGGGACTCTATGGACATAAGCCTAGTATGGTGGATTCCGAGGGCGAAATCGAGGTCTGGGTGTCCCTTTCCGTTCGGGGGAAGCTCGTGAGCTTCAGTTCGCTCGGTCGATTCCAAGACTCTTTCGGAAGCAAAAAAGAAGTTGCCCCAGCCGAGAGCAAATACTGTCGTGTGGGGGTGGGGAATAAAACGTTTCAATTCCACGACTTAGTGTGCACGGCGTTCCACGGAGAAAAACCATCGGCGGACCACGAGGCTCACCACCTGGACCATAAGCCCGAGAACAACCGGCCCGATAACTTGTGTTGGCTGACCCACGAACGAAACACCCAAGAAAGTCACCGCACCCAAACGCGAAAATCGAGTGGTCCGCAACGAAGCCGACAAATCCTGGGACGGAAGCACAAGTCCACGAAGGAATGGGTCCCGTACGCGAGCATGAAGGCGGCAGCCACGAAACTCAAGTTAGACGTCGGACCCATCAGCGCCGTCGCGAGGGGACGCTGCCGGCAGACGGGCGGGTACGAATTTAAATTCGCCGAACAACCCGACTTACCCAGGGAAGTTTGGAAATCCCTGACGGTGAACACCAAGAAAATACAGGTGAGTTCCCTGGGCCGATACATGGACTCACGAGGATTGAAAAAGTCGCCGGTGCCTAGCCGCTCCGGGTATTGCCGCGTCATGATCAACCGGAAAAATTACTTTGTCCACCGACTGGTGTGTGAGGCGTTCTGGGGCCCTCCCTCGGATTAGAGGTCAACCACAAAGACGGAAACAAATCCAACAATCATTATATGAACTTGGAATGGGTGACGAGTCGTTACAACATCTTACACAGTTACAGTACCAACAAAAACCGCCGTTCGAGTGCCGGGAAACTGAGCAAACCGGTGTACGGCCGGAAGCACAAAACAGACGACGAATGGGTGGAGTACCCGAGCATGAGGGCGGCAGCGCGACAGTTGGACCTGAAACCAGGCCCAATCTCCGCCGTCACCAAAGGAAAAAGAAAACAAACAGGAGGCTACGAATTCAAGCTGAAACCGCCGGAGGAAATATCGACGCTAAGTCTACCATGTTCTTTTAATATAAAAAACTCCACCCACTCGTCGTTGGTTTCGTGCTCCTCCAATAAACGTTAAAACAAAATGATTGCACCTAACAGACACGC
>k141_10144865
TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAAT
>k141_22888658
GGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG
>k141_18538400
CATTCTTTGTTCTTTTTCTGGTACGGGTGCTGTAACGACTTGGCCGCGGTGCAGAGATTGTGGAATTTTCAGTCGCTGTCAGACTGGAGAAAACACAAGGAGTTAGTGAACGCGCATCGCCGGGCTCCGTTTTTGTGCCCATCTACCCCGGGGGACACCGTGATCACACAGAAGATCGCTGCGCTCCAGATGAAGAAGGACCCGAATTTATTATTTCGGTCTACGCCAAGGGATCCCCGCACAAAAGAGAAGACCGTAACGCTAAAGGGGACTAAGCGGTGGTGGACTCTGTTGGATTATTACGACTGGGAGATTTCGAAACTCGACCCCTTAAAAGAAACGAAAGTTCCTCTCGGCCCATTGGAGAAAGAACGCCTGCAGACGTTATCAGAGCTTAGGAGGGAGGCATTCACTACCGGCCCGCGGATATCCAGCCTGCGCCTGACTAAGGACGAAGAGAAAAGATTTCCCAACGCAGCGAAAAAGTTAACTCGAAGACGCAATTTTGAAAACCGAGAACGAAGGGCGAAAACCGGGACTGACAGTTTAATTACTTTCCAAGTCTCCAAGGTTTACAATCAGTATGTAGAAGTTCTAGTGTCGGAAATCGACGGGCTGAGTCGTTTTCTCTTCGATCGTTGCCTAGAGGATTGTACAACAGAAGAGGAAACGATTGACGAGAAACGGCTAATAAATGTGATCTCTCGGTGCAACTTTTCATACCTGGAGTTATCGGATCTTCTATCGGATATATTCCCGGTAGTCGAGTGGATGAAAAAGAACTTCGGGATCGAACTCGGACGACCGGGCTCGGTCGTGTGAGTGGGCATCTTTTTTCCGTTTTTCCGTTTTTCGTTTTTCTTTAGAAAGCATAAGGAACGATGACTACAACAGCCCCCTTCGCAATTGACTTCTCTTCGGGTCACCGGTGCACTTGTATGAACCCGTCAGGTGTTCCGGACGTGGAACACTGTCTCACGGCTATAGTGTTCACCGTTCCGTCCCAGCTGTACAAAGTTCGCTCGTTGGCTGTGTTGTGGAGAATTTGGGAATTGTACAATGGGGGCCAACTAGATGTTCGTCTTCCCGACGGAACTCAGATCTCTTTGCCGGTTGAAAACGGCCCTATCAATCCGCCTGACCACCCGGGTTCCCCGCTTACAGCTGATTTTTCCAAGCTTTACGGGTTGCCCGGGAACTCCTACATGGAACTGGTGCCTGCACTTCAGCTCAAGCAAGGGGACCTTATTTTAGGGTACGCTCCGACGTCATCTTTAGAACGAAATCTCAACGTGGGTTTGCTAGTCCAATGTTCCGGGTTTCTAGACCCCACCGCGCAAACCGTCCAAGACACCGTTCCGGCTCTTTCTGTGCCATTGCCCGTTACTTTTAGACGGGGGAACCTCTAGTTCGAACTTTTAATTTCACGAAGTTCATTTAAAGAATAGCGGTACTATAATTATGTTACGATTGTAGCGGGAGGAGTGTTCAAAAACTTTCATCCGCTGTCGGCAGCCTACTAACTCCGTTACCACCGAGTGGAGAGACAAGCAAAACGAAATTTTTGTGAAGAACACCGGTCTAGACCGGAAGCTTCGCGGATGGTAAAAGACTTAAACTTCTTTTCTTCTACAAAAAAGAAGAAAAACTTAATTTGATTGTTTGTTATTTGTCCAAGAGACTGATTCTGTCGCGGTTTATTTTATTTTGTTTTTTTCAGTTTTTTTCATTGTTTTTATGAAAACTTCCCAAGTTTTTCTTGGGACATTCTTTTATGGCCTTCAGCCTAGATCTAAAAAAATAAAATTTTGTAGTACGTTTTTTGTTTTTCGGGGAAAAACAAAATGTGGGAAATGACTGTAGGTAAGGCCTGAGCGTTGTCATCTTCTAGGAGCTGCTGCGCCGTGTGGTCCACATACAGCACGCCCCCGTGACGACTACAGCGGCGGCTGGCAACCTCCTGATAAGAAGGGTCGTTTTCCGTTAGGTCCCTAATAGTACGGTAGTACACTTTTTTTTTATTGTTAATAGAAAAGAACAGGCCAAAAGGTCCTGTCATCTGAATAAAGCGGAATTAGTTAAGCACTTAGTGGCCACCGGGACAGGCAAACGTAACTGGGAACAGTGTAACGGAAACGCTTCTCGATCAAGAATTGGTATTTTTTATTTTTTCCTAATATGAGATAGATTACGCGGAGCAAACGTCGCAAGAGTCAGGTTTAGCTGGTGGGCCTAAGGCATGATTCACCGGATTCGTCAACGCGCTCTGTCGGAGATAATACATCCCCGTTTTCAGGCTATTTTCCCACCCGTAAAAATGGTACGAAGTGATTTGCTGGAAAGTAACGTCGGGCCTGAACCAAGCATTCAAACTCTGGCTTTGATCAACAAAGGCGTTTCTATCAACCGCGAGATCCAGCAAGTGTTTCTGAGGAATTTCGTAGGCCGTTTTATACTTCTTTTTTAGAAAGGTCAACCGAGCCGCCTGTTTCGGATCAGACGGAAGCCGTAAGTTCTGGACACTGCCATTCTGGTTAAGAATCTGGCGACGAACATCGGGCGTCCACATTGATAAAGCGGCTAGATCTCGGACTAGGTGCCGATTTACTAGCACGAACGTACCGCTTAAAACCGAACGAGTAAAAATCATCTGCGTGAACGGTTCAAAAGACTCGTTATTCCGGAGAATCTGTGCAGTGCTTGCCGTGGGCATTAACGCGATCAGCAACGAGTTCCGAAGTCCGGACCTGCGCATTCTTTCTCTCAAGATGTCCCACTCTTCTTCGGAATATTTGGCGGATGTTAAGGCACTGGGGCGTTTCAATAGATCGAAGTGCAATAGCCCTTTCTCGGAAAGACTTCCTGGAAACGACTCGAAGTGCCCGAACTTCTCCGCGAGATTAATGCTTTCTTCAACAGCGGCGTAGTACATCACTTCGAAAATTTGCCCGTTAAGAGTTCTAGCTTCTTCACCTTCCCAGGATAAATCCAATAACGCAAAAGCATCAGCTAGTCCTTGAACGCCGATACCAATCGGACGGTTCTTTAGGTTTGCATATCGAATCTCCGGTATGCCGTTGGGATAGTAAGTACGATCGATTGCTTGGTTCAGATTTTGCACCAGCTCGGCGACGAGGGTTCGTAGAACGTCAAAGTCGAAAACATCTTTTCCTTCTCGGGAGCGAAGGCACTTCGGCAAGCAGACGGCAGCTAAGTTGCACGAGGCTATTTCTTCCTGAGATGAGAACTCCACGATCTCGACACAGAGATTGGAGCACGGAATTGTGCCTAGATGCTGGTGATTGGAAGTTCTGTTGCACGCATCTTTGTAGAGCATGAAGGGCATCCCAGTCTCTTTCTGCGTGATGACGATCTGCTGCCATAACACTCTAGCACTTATTTTTGTTGCATTAGGGAAGTCCCTCTCGTAACGCCGGTAGAGAGATTCGAACTCGGCGCCCCAAACCTTTCCGAGGCCCGGAGCGTTTTTAGGACAAAAGAGCGTCCAAGAGTCGTCGTTTTTTACTCGCTTCATAAACAAGTCCGACACCCACAAGGCTTGGAACAGGTCTCGGGCGCGGAGGTCCTCGGGCCCCGTATTTTTGCGCAACTCCAAAAATTCTTGGATATCGACGTGCCAAGGCGGAAGGTACATAGTACCGCTTCCTTTCCGTCTGCCGCCTTGGTCGACGCTTCGGAGGATCTCTTGTTTTATTTTAAGCCAGTTCACGATTCCTTTGGAGCGTCCGAAATGTCGGATCGTGGAATGCCTAATGTTCGAGTAGTCGCACCCGATTCCGCCTGTGTTCTTGGAGATCACCGCACAATCGTGCCACGATTTTGTTAGTCCGGCCATCGAGTCGTCGATGGTCATGAGAAAACACGAACTCAACTGGGGTCGATCGGTTCCCGCGTTGTAGGCCGTGGGAGACGCGTGAGAGTACATCCCTGTGCTCAAGCGGTCGTACATTTTCTGGATTTGGGGGAGATTCGGGTACCAAATAAAAACTGCCATCCTAAGGTACATGTATTGCGGCGTTTCAAGATACACCGGGTCCGCCCGATCCGGAACCACCTTTCGGAGGAGGTAAGACTTAAAAAGCGTTGCGAAACCGAAAAGATCAAATTGCAAATCTCTCTCCGGGTGAAGCATCTGCTCCAAAACTTCTTGGTTTCGAGTAACAAATTTTTGATAACCGGGATCGAACATTTCTGGGAATTCGGAGACGATTTCTGCGAAAGAAAATTTTACTTTTTGTTTTAAAGCCCAGATTTGAATACGCCCCGCAAGAAGCGACCAGTCCGGGTGATCTAAATTGAGATCGGCGCAGACCTTAGCTAATTCTTCCGTATAGTCACAGATAGGAACTTGCGCATTTTCTTCCAGAACGCGGTCCAACCTTGTTTGGTCAACCTTAAGACCGGAGGCCAACACAAGCACTTCTGCGTTAGAGATGCGAGTCATCGTTTTACTGGCAAGAGACGGAAGATCGATTATTCTTATTAGGATTCAGTTTCTTAAATGCTCTTCACGGGCTTTGAGAAAAAACTCGCAAAGCGACTTGAGGTCTCCCTTTTGTTTTTTTTTATTTTTCATTTATACAACGTTTCTTAGTTTTCAGGGGAGGGACTGGCATCCTCGTTCTGGGTTCTGCGTGAAGCCCGATCAACAATAAGGAGTATAAAACACTCATCTCCCTTCCAGTGACGACGCCGGGTTCGCAACGGCAGCTCCCTGTTCCCATAATGTGAATTCCCGGTTGTGCATGAAGTACATAAATAAAAAGTGCTCGGTAAGTTAAAAAAATAGGTGGCTCCCAACGGAAAGTATTCGAGCTTTGAAGTACTTTTTACGTTTGCCAGCGAGTCCACAATCAGCGCCAACTGAAACAGCTGTGGACACGTTAAACATGCAAAACGTACATTTGCATAGGTAATCTCTTGAATGGGAACCACCTATTTTTTTAACTTACCGAGCACTTTTTATTTATGTACTTCATGCACAACCGGGAATTCACATTATGGGACGTAGGTACGACGAAAATCCAAAGAGCGC
COBRA_end_joining_pairs.txt:
k141_9211005_L k141_7723568_Lrc
k141_7723568_Lrc k141_9211005_L
k141_13152222_L k141_6797120_Lrc
k141_6797120_Lrc k141_13152222_L
k141_6797120_L k141_13152222_Lrc
k141_13152222_Lrc k141_6797120_L
k141_7723568_L k141_9211005_Lrc
k141_9211005_Lrc k141_7723568_L
k141_22888658_L k141_10144865_R
k141_10144865_R k141_22888658_L
k141_10144865_R k141_9211005_Rrc
k141_18538400_L k141_22692969_R
k141_18538400_L k141_20768437_R
k141_22692969_R k141_18538400_L
k141_20768437_R k141_18538400_L
k141_20768437_L k141_13152222_R
k141_20768437_L k141_11312800_R
k141_13152222_R k141_20768437_L
k141_11312800_R k141_20768437_L
k141_9211005_R k141_10144865_Rrc
k141_10144865_Rrc k141_9211005_R
k141_10144865_Rrc k141_22888658_Lrc
k141_9211005_Rrc k141_10144865_R
k141_7723568_R k141_6797120_Rrc
k141_6797120_Rrc k141_7723568_R
k141_6797120_R k141_7723568_Rrc
k141_7723568_Rrc k141_6797120_R
k141_22888658_Lrc k141_10144865_Rrc
k141_11312800_Rrc k141_20768437_Lrc
k141_13152222_Rrc k141_20768437_Lrc
k141_20768437_Lrc k141_11312800_Rrc
k141_20768437_Lrc k141_13152222_Rrc
k141_20768437_Rrc k141_18538400_Lrc
k141_22692969_Rrc k141_18538400_Lrc
k141_18538400_Lrc k141_20768437_Rrc
k141_18538400_Lrc k141_22692969_Rrc
Thanks!
Not this file but the COBRA_potential_joining_paths.txt
file in the folder of intermediate.files
. You probably need to send me the file of both runs. I will take a look tomorrow if you could send soon (here or linkingchan@gmail.com
). Thanks.
Sure. the COBRA_potential_joining_paths.txt
of the first run:
k141_10144865_L []
k141_10144865_R ['k141_22888658_L']
k141_7723568_L []
k141_7723568_R []
k141_22888658_L []
k141_22888658_R []
k141_13152222_L []
k141_13152222_R []
k141_11312800_L []
k141_11312800_R []
k141_22692969_L []
k141_22692969_R []
k141_20768437_L ['k141_11312800_R']
k141_20768437_R []
k141_9211005_L []
k141_9211005_R []
k141_18538400_L ['k141_20768437_R', 'k141_11312800_R']
k141_18538400_R []
k141_6797120_L []
k141_6797120_R []
for the second run:
k141_13152222_L []
k141_13152222_R []
k141_10144865_L []
k141_10144865_R ['k141_22888658_L']
k141_9211005_L []
k141_9211005_R []
k141_22692969_L []
k141_22692969_R []
k141_7723568_L []
k141_7723568_R []
k141_18538400_L ['k141_20768437_R', 'k141_11312800_R']
k141_18538400_R []
k141_20768437_L ['k141_11312800_R']
k141_20768437_R []
k141_6797120_L []
k141_6797120_R []
k141_22888658_L []
k141_22888658_R []
k141_11312800_L []
k141_11312800_R []
It is so weird, they should be the same given that the potential_joining_paths is exactly the same.
Could you please send me the debug file (could be huge I guess, if yes, pls send via email)? Thank you.
Hi, I checked the debug files you sent me. From where I could see that the sequences of M72_2|k141_22888658
and M72_2|k141_10144865
were validly joined until the last step. For now, I could not tell what is happening unfortunately. If possible, could you please perform the 1st run again (ensure everything is the same)? Thank you.
Sorry for the late reply. I created a new environment and install and run the same command again, and all the three resutls are different. Next I'll try to understand what happened durning the software running.
did you installed the newest version (there are some modifications therein) and thus got another different results?
No, i ran the last try three weeks ago.
that's so weird. I am still confused what are your input files of -f/--fasta and -q/--query.
I'm going to run the python script line by line, and I just format the code using black.
As you suggested, one should first use the assembly result with default parameters, then map reads to the assembly result for coverage. Next, raw assembly result will be parsed as fasta, and longer sequences from assembly result will be selected as query.
In my practice, I renamed sequences (add M72_2|
to identify sample name) and only keep sequences longer or equal to 1000bp as "raw assembly result 1", and map reads to this "raw assembly result 1". Query sequences are those longer or equal to 2500bp contigs based on "raw assembly result 1"
if in your practice, -f/--fasta = "raw assembly result 1", then it is incorrect, -f/--fasta should be all the assembled contigs without length filtering. -q/--query could be contigs of any length though.
Sure, for further analysis, I'll keep all contigs for downstream study such as cobra.
What troubles me is that I've run the software for three times with all the same parameters, and get three different results. Is it designed to be? Or if I used "all the assembled contigs without length filtering", then the outputs can be consistent?
Thanks for your reply. Sincerely, hwrn.
Of course this is NOT COBRA designed to be. I have to run it myself to see what is going on.
Hello, I've found that some unconsistant action may caused by these codes:
This code read a mapped bam file, which can be generated from different mapping software such as bbmap.sh
and bwa mem
. However, the two software act differently when storing reads name in bam file:
Given a fq
file with these sequences as example:
_1.fq:
@DP8450004631BRL1C001R00500000953/1
CTCGACCTCAACCGGAACCGCATCCATGACTCGCTGGAGAACGCCTCCGGACCAGTGAATGTAGGGTTGTCATACGGCCATCAGCCGACCACGAGTGATGTCCGTGCTGTCAAGGAACTCGGCCATTCCGTGAACCTCGAGGTGCCTGAG
+
GG:@CHFI9HGIIIG>IGIFHIIFIIHIDHHIGIH?DGG:IIH@IIIHIH:GIHIDHHHH6@IIII2FIG+HHIHIIHHIIIIHIIIHIIHIFBI6HGHF;GIHHIHIEGGGFFGEGI7HGCGIHF8H:57D=GHIIFH8FIFHHGHE19
_2.fq.gz:
@DP8450004631BRL1C001R00500000953/2
CCAATGTCCCGTAGCGCTCGACCATCACCACGCCGTCCAGGTCAGCCAGGACCCACACGTCGTCCGCATCCACCGCCCCGAGAAGCAGAGCGTCTACCTCAGGCACCTCGAGGTTCACGGAATGGCCGAGTTCCTTGGCAGCACGGGCAT
+
GGGF5<'GCFFHHEEGF<GDF'7GDGIIC.<:GGHEGFEHE%FHGFF?DHDCEGIGGFFFEF*&CBG7F;FGC9&FFHA9EC>E'D=GEFFGGF&BH90G?GGF'=FIG?2HE@CEICF.EFF>;BHG>G6=*G?HD,AGBD>G*H7FFF
For BGI data, paired reads are identified with the suffix (/1
and /2
), which will be kept by bbmap.sh
but discarded by bwa mem
. That is, the sam file generated by bbmap.sh
may look like:
FP150000508TLL1C034R03603075335/1 97 k141_17086954 14 42 2X29=1X118= k141_8064633 197 0 CGCCCATGGCCACCAGGGCGACGAAGAGCGCGAAGTAGGACGCACGTGACAAGGTCATGATAAAAGGCACGAAGGCCGTCGAGGCGATCGCGAAGAATGTCCATCTCCAATGATAGGTCGGCGCATAGAGCGCTAGCGCCATGGACAGGC FFFFEFDFFF>EEEFF.EFFFADFFFBFFFFFFFFFFEFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:3 AM:i:42
FP150000508TLL1C005R02300778018/2 145 k141_17086954 116 3 8=1X7=1X14=1X71=1X46= k141_8064633 54 0 ATCTCCAACGATAGGTGGGCGCATAGAGCGCCAGCGCCATGGACAGGCAGATAACGATGAGCAGGTAGCCGCCGAGTGTATTGGGCTCCGTGCCCCCTGCCTCAAAGGGCGCACTCACGCGCGGCAGCGTGCCGATACTGATGATCCCGT EFFFFFFEFFFFFFFFFFFFFFFFFFFFEFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF XT:A:R NM:i:4 AM:i:3
while that generated by bwa mem
may look like:
DP8450004631BRL1C017R00701242685 99 M80_2|k141_8788616 156 40 116S19M15S = 156 19 ATGAAGTGTGATGATTTACTGTTCCAATAAGGAATATACTCAGGTCGCCCAATAGCGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCA IIGGG@FHFGGHIGIGIIGIHGFGGIIHIHHGGGIIHIGIIHGFFGIGIGIIIIEFGFHGHGIHGHHGHIIDIHBGGHHIHGGHGDFGGHGIHIHIHEGG=IHIHHIGHGAEIIIHFGGGHIIIHIIIIIHGFBHHIIIGIHHGIIIIII NM:i:0 MD:Z:19 MC:Z:61S19M70S AS:i:19 XS:i:0
DP8450004631BRL1C017R00701242685 147 M80_2|k141_8788616 156 40 61S19M70S = 156 -19 CGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCAGATTTAGAAAGATTATTAACAATATAATCAATAAAGCCCTCAGCAACTTTCCCAG HIIIIGHFIIIHGHIIIHHIIIIIIIIHIIHIIIIIHIHIIIIIFIIIIIIHIIHIIIIIIIIHIIIHGIHIIIHIIIIIHIIIIHIIIIIHIIIIIHIIIIGIHIIIIIIIIIIIIIIIIIHIHIIHIIHHIIFIHHIIGIIGGGGHII NM:i:0 MD:Z:19 MC:Z:116S19M15S AS:i:19 XS:i:0
In this case, https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L891 will never recognie read of paired into the same list.
Besides, why should we care about paired reads that both at the terminal of a contig, is it for determinate self-circular contig only? Why we should only care about the start of the read, but not the end, is it set based on experience?
Thanks!
p.s.: I'm trying to refactor cobra.py with suggestions from black
and mypy
, would you mind kindly accept some requests that make the code more structured and readable? : )
Hi,
Thank you so much for your efforts.
Best, LINXING
Thanks!
line.query_name
from map_file and PE name in contig_spanned_by_PE_reads
. When I noticed {len(PE) for contig in contig_spanned_by_PE_reads for PE in contig}
returns {1}
, I checked reads name from the same contig, and then found reads names that nearly all the same but different suffices. pysam
don't distinguish this. bbmap.sh
and may face the problem. Sincerely, hwrn
could you please share me the different sam/bam files that you created?
regarding "why should we care about paired reads that both at the terminal of a contig, is it for determinate self-circular contig only? Why we should only care about the start of the read, but not the end, is it set based on experience?", it is not only for self-circular contigs but also, given the insert length in library construction, (1) you should NOT let the two paired-end reads span too distant, (2) paired-end reads must span the ends of two contigs that are going to be joined. Hope this makes sense to you.
Sure. The bam generated using bwa mem
was mapped to dereplicated MAGs for other purpose, and the other bam I'll provide is the newly genereated one using all assembly result.
However, the bam files are too large (>80G each), and I'll share it using email. Before that, the two example may be used for test directly?
For BGI data, paired reads are identified with the suffix (
/1
and/2
), which will be kept bybbmap.sh
but discarded bybwa mem
. That is, the sam file generated bybbmap.sh
may look like:FP150000508TLL1C034R03603075335/1 97 k141_17086954 14 42 2X29=1X118= k141_8064633 197 0 CGCCCATGGCCACCAGGGCGACGAAGAGCGCGAAGTAGGACGCACGTGACAAGGTCATGATAAAAGGCACGAAGGCCGTCGAGGCGATCGCGAAGAATGTCCATCTCCAATGATAGGTCGGCGCATAGAGCGCTAGCGCCATGGACAGGC FFFFEFDFFF>EEEFF.EFFFADFFFBFFFFFFFFFFEFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:3 AM:i:42 FP150000508TLL1C005R02300778018/2 145 k141_17086954 116 3 8=1X7=1X14=1X71=1X46= k141_8064633 54 0 ATCTCCAACGATAGGTGGGCGCATAGAGCGCCAGCGCCATGGACAGGCAGATAACGATGAGCAGGTAGCCGCCGAGTGTATTGGGCTCCGTGCCCCCTGCCTCAAAGGGCGCACTCACGCGCGGCAGCGTGCCGATACTGATGATCCCGT EFFFFFFEFFFFFFFFFFFFFFFFFFFFEFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF XT:A:R NM:i:4 AM:i:3
while that generated by
bwa mem
may look like:DP8450004631BRL1C017R00701242685 99 M80_2|k141_8788616 156 40 116S19M15S = 156 19 ATGAAGTGTGATGATTTACTGTTCCAATAAGGAATATACTCAGGTCGCCCAATAGCGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCA IIGGG@FHFGGHIGIGIIGIHGFGGIIHIHHGGGIIHIGIIHGFFGIGIGIIIIEFGFHGHGIHGHHGHIIDIHBGGHHIHGGHGDFGGHGIHIHIHEGG=IHIHHIGHGAEIIIHFGGGHIIIHIIIIIHGFBHHIIIGIHHGIIIIII NM:i:0 MD:Z:19 MC:Z:61S19M70S AS:i:19 XS:i:0 DP8450004631BRL1C017R00701242685 147 M80_2|k141_8788616 156 40 61S19M70S = 156 -19 CGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCAGATTTAGAAAGATTATTAACAATATAATCAATAAAGCCCTCAGCAACTTTCCCAG HIIIIGHFIIIHGHIIIHHIIIIIIIIHIIHIIIIIHIHIIIIIFIIIIIIHIIHIIIIIIIIHIIIHGIHIIIHIIIIIHIIIIHIIIIIHIIIIIHIIIIGIHIIIIIIIIIIIIIIIIIHIHIIHIIHHIIFIHHIIGIIGGGGHII NM:i:0 MD:Z:19 MC:Z:116S19M15S AS:i:19 XS:i:0
hold on. 80 Gbp is too large.
Do you know if there are any other sequencers generate reads with headers like those from BGI?
I've searched my sequence data, but even the smallerst data is 26Gb (total size of clean.1.fq.gz and clean.2.fq.gz). The demo data provided by BGI is 100Gb (💯 ). However, I read the code of bwa mem
and found out how bwa
treat the paired-end reads:
aux.ks
and aux.ks2
: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/fastmap.c#L376-L392bseq_read
: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/fastmap.c#L64-L73C15bseq_read
, readno will be trimmed first: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/bwa.c#L79C10-L94trim_readno
just looked at the last two number, and remove them: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/bwa.c#L54C20-L58
static inline void trim_readno(kstring_t *s)
{
if (s->l > 2 && s->s[s->l-2] == '/' && isdigit(s->s[s->l-1]))
s->l -= 2, s->s[s->l] = 0;
}
This code is from a commit (moved some common code to bwa.{c,h}) 12 years ago with
If so, it looks like the /1 and /2 is not a problem am I correct?
I think that /1
and /2
is a normal way to identify paired reads, and can be moved once necessary. I've added a "--trim_readno" param to handle it
Meanwhile, I think that line 277 is useless for detect_self_circular
: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L265-L285
When consider link_pair
that collect linkage between header + "_L"
, header + "_R"
, header + "_Lrc"
, and header + "_Rrc"
, if maxk_length
is set as odd number correctly, seq[:maxk_length]
will never be the same as reverse_complement(seq[:maxk_length])
-- the middle base is ALWAYS different.
Next I'm puzzled at line 966: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L962-L969
I think that the end_part
may occur in the middle of the sequence for some time. For example, I think this is a valid self_circular contig overlap with kmer[2n+1]:
kmer[2n+1 - end_part] + kmer[end_part] + any + kmer[end_part] + any2 + kmer[2n+1 - end_part] + kmer[end_part]
Next I'm puzzled at line 966:
I think that the
end_part
may occur in the middle of the sequence for some time. For example, I think this is a valid self_circular contig overlap with kmer[2n+1]:kmer[2n+1 - end_part] + kmer[end_part] + any + kmer[end_part] + any2 + kmer[2n+1 - end_part] + kmer[end_part]
We should pay attention to those with the end_part
in the middle, if ever exists. I have no idea why this happens, better to ignore such ones, that's why I set it == 2.
Meanwhile, I think that line 277 is useless for
detect_self_circular
:When consider
link_pair
that collect linkage betweenheader + "_L"
,header + "_R"
,header + "_Lrc"
, andheader + "_Rrc"
, ifmaxk_length
is set as odd number correctly,seq[:maxk_length]
will never be the same asreverse_complement(seq[:maxk_length])
-- the middle base is ALWAYS different.
You are right, but (1) will this step take long? and (2) probably someone will use even kmer numbers?
And this line should be used to exclude some other abnormal cases of false positive "self-circular", I could not recall what it is unfortunately. Otherwise, I could only use if contig + '_R' in link_pair[end]
for both one_path_end and two_paths_end, right?
We should pay attention to those with the end_part in the middle, if ever exists. I have no idea why this happens, better to ignore such ones, that's why I set it == 2.
In my test data, there is no such case among 172617 orphan_end_query sequences. However, I think it can be happen, for example, on a sequence contains CRISPR spacers, and the circulate sequence breaks near a spacer.
And this line should be used to exclude some other abnormal cases of false positive "self-circular", I could not recall what it is unfortunately.
Thanks for your explanation! I just curious about in which condition this line will effort. Besides, I think assemblers will rarely allow even kmer length to avoid kmer equal to reverse complement sequence of itself.
xxxx ---- xxxx ---- xxxx (xxxx = the length of maxK for metaspades and megahit or maxK-1 for idba_ud) should not exist in my opinion.
it is not for even kmer but something else that I cant recall for now.
- xxxx ---- xxxx ---- xxxx (xxxx = the length of maxK for metaspades and megahit or maxK-1 for idba_ud) should not exist in my opinion.
- it is not for even kmer but something else that I cant recall for now.
xxxx ---- xxxx ---- xxxx
(len(xxxx) = maxK) will never form in assembly. What I concern is the pattern yy xxx ---- xxx ---- yy xxx
(len(xxx) = minK) when finding self_circular with not-long-enough expected overlap. Thanks for your kind explanations!
- xxxx ---- xxxx ---- xxxx (xxxx = the length of maxK for metaspades and megahit or maxK-1 for idba_ud) should not exist in my opinion.
- it is not for even kmer but something else that I cant recall for now.
- For your definition, i agree that pattern
xxxx ---- xxxx ---- xxxx
(len(xxxx) = maxK) will never form in assembly. What I concern is the patternyy xxx ---- xxx ---- yy xxx
(len(xxx) = minK) when finding self_circular with not-long-enough expected overlap.- OK, I've recorded this risk and will keep attention on it.
Thanks for your kind explanations!
I agree with you on 1. that makes sense. we should modify this. Thanks.
Hi, I've found that the loop in [09/23]
can be speed up significantly by changing https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L1000-L1001 to:
for contig in tqdm(
query_set - (orphan_end_query | self_circular),
desc="Detecting joins of contigs. ",
):
For example, in my data, len(query_set)=339757, len(orphan_end_query)=172917, len(self_circular)=181, the old version will more than 30 min to finish running, while the new version take only 3 seconds.
Meanwhile, I noticed that an upper bound of target contig is set when trying to append it to contig2join
. https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L363 , seems to avoid adding target contig with too high coverage
I also notice that this restriction only appear in two_paths_end
situation for all added contigs except the first one, so I'me also curious about why we don't worry about the first contig with much higher abundance to join with the contig?
Regarding [09/23]
, did you only modify the lines of 1000 and 1001? If yes, I do not think this modification is the main reason of the time reduction. And I am wondering if the time you monitored is exactly for the whole step.
Please explain why if you disagree.
I modified
for contig in query_set:
if contig not in list(orphan_end_query) + list(self_circular)
to
for contig in query_set - (orphan_end_query | self_circular):
and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.
And Line 363 of
if cov[contig_name(target)] >= 1.9 * cov[contig]:
is to exclude the case of repeat region longer than the max kmer, for example, transposase genes (xxxx below).
-------------xxxx-------------xxxx---------------xxxx------------------ ------1------.......-------2-----.......-------3------........-------4----------
if region 1 is the query, the join could be 1+2, 1+3 or 1+4, which could not be determined.
and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.
Oh, I tried a toy example on my local machine, and also cannot reproduce this problem:
from tqdm import tqdm
a = {"K141_{i}" for i in range(3000000)}
b = {"K141_{i}" for i in range(1500000)}
for contig in tqdm(a):
if contig not in list(b):
pass
However, I could still repruduce it using the dataset I'm testing on. It is repeatly transfer orphan_end_query
to a list()
that cost a lot of time (Probably a problem related to the memory alloc and useage)...
Thanks for your kind explaination! I've got it!
and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.
Oh, I tried a toy example on my local machine, and also cannot reproduce this problem:
from tqdm import tqdm a = {"K141_{i}" for i in range(3000000)} b = {"K141_{i}" for i in range(1500000)} for contig in tqdm(a): if contig not in list(b): pass
However, I could still repruduce it using the dataset I'm testing on. It is repeatly transfer
orphan_end_query
to alist()
that cost a lot of time (Probably a problem related to the memory alloc and useage)...Thanks for your kind explaination! I've got it!
sorry what problem you are talking that you could not reproduce?
The extremely longer time costing when generating a list for orphan_end_query
repreatly.
>>> timeit("list(orphan_end_query)", number=1000, globals=globals())
3.173712281975895
>>> len(orphan_end_query)
172917
>>> timeit("list(orphan_end_query)", number=1000, globals={"orphan_end_query": {f'k141_{i}' for i in range(172917)}})
3.205506692000199
This means that only 300 entry can be processed per second, and 1/(1000/3.2) * 339757 / 60
= 18 min will be waste for it...
And I've test it on my computer:
>>> timeit("list(orphan_end_query)", number=1000, globals={"orphan_end_query": {f'k141_{i}' for i in range(172917)}})
5.158639047993347
Thanks for this great tool! However, I have a few questions on input for cobra
./intermediate_contigs/k141.contigs.fa
, in which there are many small contigs < 200 bp (which will be filtered in./final.contigs.fa
). My question is, which contig file is more recommended to be used as--fasta FASTA
input?final.contigs.fa
as--query QUERY
file, or a filtered version with all contigs longer than given length (i.e. 1000 or 2500 bp)? In another words, can cobra be used before virus contigs annotation and MAG binning?Regards, hwrn