linxingchen / cobra

A tool to raise the quality of viral genomes assembled from short-read metagenomes via resolving and joining of contigs fragmented during de novo assembly.
MIT License
62 stars 10 forks source link

Questions on input fasta files #15

Open Hocnonsense opened 9 months ago

Hocnonsense commented 9 months ago

Thanks for this great tool! However, I have a few questions on input for cobra

  1. Now I'm using megahit to assembly (I just know from #3 that it may generate many chimeric contigs, but spades may run out of memory when assembly environmental samples). In megahit intermediate results, it will provide a file named ./intermediate_contigs/k141.contigs.fa, in which there are many small contigs < 200 bp (which will be filtered in ./final.contigs.fa). My question is, which contig file is more recommended to be used as --fasta FASTA input?
  2. Is it possible to just use the final.contigs.fa as --query QUERY file, or a filtered version with all contigs longer than given length (i.e. 1000 or 2500 bp)? In another words, can cobra be used before virus contigs annotation and MAG binning?

Regards, hwrn

linxingchen commented 9 months ago

Hi hwrn,

Thank you for your interest in COBRA.

Regarding your questions, please see my answers below:

  1. you should use the final.contigs.fa as the input file for flag --fasta/-f of COBRA.
  2. do not use final.contigs.fa as queries, as it will take too long to finish. It is ok to use those contigs with a minimum length (e.g., 2500 bp) as queries, but as you can see from the reviewer comments, that one of the reviewers thought there may be some issues that we could not predict. We know that it is good if we could extend everything using COBRA before we do binning, but you may have to take the risk to do so.

I hope this helps. Let me know if you want to discuss more.

Best, LINXING

Hocnonsense commented 9 months ago

Thanks! These days I've tried cobra on my data, However, it seemed to stoped at step [11/23] for nearly 20 hours. Is it ok?

2. PROCESSING STEPS
[01/23] [2024/02/09 20:22:34] Reading contigs and getting the contig end sequences. A total of 1980767 contigs were imported.
[02/23] [2024/02/09 20:24:12] Getting shared contig ends.
[03/23] [2024/02/09 20:24:30] Writing contig end joining pairs.
[04/23] [2024/02/09 20:24:30] Getting contig coverage information.                                              
[05/23] [2024/02/09 20:24:33] Getting query contig list. A total of 339757 query contigs were imported.
[06/23] [2024/02/09 20:24:42] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/23] [2024/02/09 20:46:40] Parsing the linkage information.
[08/23] [2024/02/09 20:46:47] Detecting self_circular contigs.
[09/23] [2024/02/09 21:06:56] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[10/23] [2024/02/09 23:19:30] Saving potential joining paths.
[11/23] [2024/02/09 23:19:33] Checking for invalid joining: sharing queries.

Next I found that same_path are finished. However, contig_shared_by_paths wasn't export till now, and there is a for loop here: https://github.com/linxingchen/cobra/blob/7eacae7c7aea049cd1d5ad5cbab88f961aeb11c5/cobra.py#L1194-L1213

I'm curious about why there are so many pass, is there any condision that we may care about? And then, can this code be improved like this?:

    for contig in tqdm.tqdm(set(all))
        if all.count(contig) > 1 and contig not in failed_join_list:
            for contig_1 in contig2assembly:
                if contig_1 not in redundant:
                    if contig in contig2assembly[contig_1]:
                        contig_shared_by_paths.add(contig_1)

p.s.1, to support biopython>=1.82, https://github.com/linxingchen/cobra/blob/7eacae7c7aea049cd1d5ad5cbab88f961aeb11c5/cobra.py#L13 can be changed as from Bio.SeqUtils import gc_fraction as GC p.s. 2, 祝您龙年春节快乐,新年身体健康,工作顺利,万事如意!

linxingchen commented 9 months ago

Hi,

Sorry to hear that the sharing queries step took so long, one reason is because you have a lot of joins given the huge number of queries (339757). Did you use everything >= 1000bp?

Your suggestion on the lines looks good, could you please try on your end to see how fast it could be?

Regarding GC function, could I just add one more line from Bio.SeqUtils import gc_fraction as GC without checking which version the user may have installed? I am not sure about this. If I just change that line, users using 1.81 should meet error.

Thank you.

Happy New Year.

Best, LINXING

Hocnonsense commented 9 months ago

Thanks!

I've filtered 1980767 contigs >=1000 bp for fasta assembly, and used 339757 contigs >= 2500 as query. It's really a big project. Now the program is finished, and the log is here:

2. PROCESSING STEPS
[01/23] [2024/02/09 20:22:34] Reading contigs and getting the contig end sequences. A total of 1980767 contigs were imported.
[02/23] [2024/02/09 20:24:12] Getting shared contig ends.
[03/23] [2024/02/09 20:24:30] Writing contig end joining pairs.
[04/23] [2024/02/09 20:24:30] Getting contig coverage information.
[05/23] [2024/02/09 20:24:33] Getting query contig list. A total of 339757 query contigs were imported.
[06/23] [2024/02/09 20:24:42] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/23] [2024/02/09 20:46:40] Parsing the linkage information.
[08/23] [2024/02/09 20:46:47] Detecting self_circular contigs.
[09/23] [2024/02/09 21:06:56] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[10/23] [2024/02/09 23:19:30] Saving potential joining paths.
[11/23] [2024/02/09 23:19:33] Checking for invalid joining: sharing queries.
[12/23] [2024/02/11 03:45:03] Getting initial joining status of each query contig.
[13/23] [2024/02/11 03:59:01] Getting final joining status of each query contig.
[14/23] [2024/02/11 03:59:14] Getting the joining order of contigs.
[15/23] [2024/02/11 04:00:07] Getting retrieved contigs.
[16/23] [2024/02/11 04:00:15] Saving joined seqeuences.
[17/23] [2024/02/11 04:00:21] Checking for invalid joining using BLASTn: close strains.
[18/23] [2024/02/11 04:02:48] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[19/23] [2024/02/11 04:02:51] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[20/23] [2024/02/11 04:02:51] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
[21/23] [2024/02/11 04:08:00] Saving joining status of all query contigs.
[22/23] [2024/02/11 04:08:31] Saving self_circular contigs.
[23/23] [2024/02/11 04:08:31] Saving the new fasta file.

3. RESULTS SUMMARY
# Total queries: 339757
# Category i   - Self_circular: 181
# Category ii  - Extended_circular: 0 (Unique: 0)
# Category ii  - Extended_partial: 17524 (Unique: 11521)
# Category ii  - Extended_failed (due to COBRA rules): 76038
# Category iii - Orphan end: 246014
# Check "COBRA_joining_status.txt" for joining status of each query.
# Check "COBRA_joining_summary.txt" for joining details of "Extended_circular" and "Extended_partial" queries.

I've submit the job again, and it may take three days to finish. If COBRA can start from the last broken running, it will be very helpful!

for biopython, gc_fraction is introduced to replace GC since 1.80, so only those using biopython<=1.79 will meet error.

Regards, hwrn

linxingchen commented 9 months ago

Hold on. For the -f flag input you should use all contigs without length filtering. Thats why did not get any extended_circular sequence.

Hocnonsense commented 9 months ago

For megahit itself, there is a param --min-contig-len which will control the length of output contigs (default 200). Other assemblers also have similar params to filter shorter contigs. On the other hand, you also indicated that intermediate contigs with much shorter contigs should not be used. For your opinion, which threshold is proper for assembly before COBRA? Thanks!

Meanwhile, if I already mapped reads to unfiltered final.contigs.fa, can I use this coverage for binning directly, where only contigs >=2500bp will be used? Another choice is mapping reads to the subset of contigs >=2500bp and generate another abundance file. Which is preferred in your opinion? Thanks!

linxingchen commented 9 months ago

For megahit itself, there is a param --min-contig-len which will control the length of output contigs (default 200). Other assemblers also have similar params to filter shorter contigs. On the other hand, you also indicated that intermediate contigs with much shorter contigs should not be used. For your opinion, which threshold is proper for assembly before COBRA? Thanks!

Meanwhile, if I already mapped reads to unfiltered final.contigs.fa, can I use this coverage for binning directly, where only contigs >=2500bp will be used? Another choice is mapping reads to the subset of contigs >=2500bp and generate another abundance file. Which is preferred in your opinion? Thanks!

Hi,

I do not suggest changing the default value of --min-contig-len of MEGAHIT or the similar flag of other assemblers.

Technically you should use the bam/sam file mapped to all unfiltered contigs to get the coverage file for binning.

Please keep in mind that for COBRA, -f/--fasta = all the contigs from an assembly, -q/--query = the contigs you want COBRA to extend.

Hocnonsense commented 9 months ago

Thanks for your kind advices! They help me a lot!

After rerun, it saved a lot of time and (the loop I marked above only took 18 minutes)

2. PROCESSING STEPS
[01/23] [2024/02/11 13:05:52] Reading contigs and getting the contig end sequences. A total of 1980767 contigs were imported.                                         
[02/23] [2024/02/11 13:07:30] Getting shared contig ends.
[03/23] [2024/02/11 13:07:48] Writing contig end joining pairs.
[04/23] [2024/02/11 13:07:48] Getting contig coverage information.
[05/23] [2024/02/11 13:07:50] Getting query contig list. A total of 339757 query contigs were imported.                                                               
[06/23] [2024/02/11 13:07:59] Getting contig linkage based on sam/bam. Be patient, this may take long.                                                                
[07/23] [2024/02/11 13:30:10] Parsing the linkage information.
[08/23] [2024/02/11 13:30:17] Detecting self_circular contigs.
[09/23] [2024/02/11 13:50:36] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.                                                 
[10/23] [2024/02/11 16:02:44] Saving potential joining paths.
[11/23] [2024/02/11 16:02:47] Checking for invalid joining: sharing queries.
[12/23] [2024/02/11 20:52:32] Getting initial joining status of each query contig.
[13/23] [2024/02/11 21:06:13] Getting final joining status of each query contig.
[14/23] [2024/02/11 21:06:26] Getting the joining order of contigs.
[15/23] [2024/02/11 21:07:18] Getting retrieved contigs.
[16/23] [2024/02/11 21:07:30] Saving joined seqeuences.
[17/23] [2024/02/11 21:07:46] Checking for invalid joining using BLASTn: close strains.                                                                               
[18/23] [2024/02/11 21:10:13] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[19/23] [2024/02/11 21:10:16] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[20/23] [2024/02/11 21:10:16] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
linxingchen commented 9 months ago

Hi, thanks for the update.

I am confused. (1) did you use the several lines you wrote to replace those in the original script? (2) which step took only 18 mins? I do not see that. Please clarify. (3) The number of contigs in step [01/23] remains the same, did you still use those >= 1000 bp for -f/--fasta input?

Should be great if you could let me know what you have done.

Hocnonsense commented 9 months ago

Thanks for your quick reply!

  1. yes, I edited line from 1198 to 1209 in cobra.py to:

    for contig in tqdm.tqdm(set(all))
        if all.count(contig) > 1 and contig not in failed_join_list:
            for contig_1 in contig2assembly:
                if contig_1 not in redundant:
                    if contig in contig2assembly[contig_1]:
                        contig_shared_by_paths.add(contig_1)

    of course, the module tqdm is imported first

  2. tqdm reported the time spended for the loop in (1)

  3. Yes, to compare with the last results, i used the same input. Next time I will try to use the original output of megahit. Will shorter contigs improve precision of cobra results? (for example, indicate the merge of two long contigs is unreliable)

linxingchen commented 9 months ago

Great.

for 1 and 2, could you please compare and let me know if the results are the same, before and after you edited the lines? If it works, I will update these lines in the next release.

for 3, yes, COBRA needs those very short contigs to connect long contigs to make them longer. You could check Figure 2f in the paper, and will find the short ones are very important.

Hocnonsense commented 9 months ago

I've checked the results. Unluckly, the two results is not the same. I think this is caused by hashing of python, which will iterate keys (contig id) in different order in different runs.

I've checked an example:

related contigs:

>M72_2|k141_22888658
GGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG
>M72_2|k141_10144865
TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAAT

in the first run, this two sequences are in COBRA_category_ii-c_extended_failed.fasta.summary.txt -- they are Extended_failed category_ii-c in the second run, these two sequences are in COBRA_category_ii-b_extended_partial_unique_joining_details.txt:

Final_Seq_ID    Joined_Len  Status  Joined_Seq_ID   Direction   Joined_Seq_Len  Start   End Joined_Seq_Cov  Joined_Seq_GC   Joined_reason
M72_2|k141_10144865_extended_partial    5623    Partial M72_2|k141_10144865 forward 2880    1   2880    36.744  0.452   query
M72_2|k141_10144a865_extended_partial   5623    Partial M72_2|k141_22888658 forward 2884    2740    5623    23.535  0.467   the_better_one

and the joined sequence is:

>M72_2|k141_10144865_extended_partial
TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG

------ edited and added below ------

However, when I check intermediate.files/COBRA_end_joining_pairs.txt, I found that both results are the same (contig id are k141_22888658 and k141_10144865):

M72_2|k141_10144865_R   M72_2|k141_22888658_L
M72_2|k141_10144865_R   M72_2|k141_9211005_Rrc
M72_2|k141_10144865_Rrc M72_2|k141_22888658_Lrc
M72_2|k141_10144865_Rrc M72_2|k141_9211005_R
M72_2|k141_22888658_L   M72_2|k141_10144865_R
M72_2|k141_22888658_Lrc M72_2|k141_10144865_Rrc
M72_2|k141_9211005_R    M72_2|k141_10144865_Rrc
M72_2|k141_9211005_Rrc  M72_2|k141_10144865_R

The related sequence, k141_9211005, related to a series of sequences:

M72_2|k141_10144865_R   M72_2|k141_9211005_Rrc
M72_2|k141_10144865_Rrc M72_2|k141_9211005_R
M72_2|k141_11312800_R   M72_2|k141_20768437_L
M72_2|k141_11312800_Rrc M72_2|k141_20768437_Lrc
M72_2|k141_13152222_L   M72_2|k141_6797120_Lrc
M72_2|k141_13152222_Lrc M72_2|k141_6797120_L
M72_2|k141_13152222_R   M72_2|k141_20768437_L
M72_2|k141_13152222_Rrc M72_2|k141_20768437_Lrc
M72_2|k141_18538400_L   M72_2|k141_20768437_R
M72_2|k141_18538400_Lrc M72_2|k141_20768437_Rrc
M72_2|k141_20768437_L   M72_2|k141_11312800_R
M72_2|k141_20768437_L   M72_2|k141_13152222_R
M72_2|k141_20768437_Lrc M72_2|k141_11312800_Rrc
M72_2|k141_20768437_Lrc M72_2|k141_13152222_Rrc
M72_2|k141_20768437_R   M72_2|k141_18538400_L
M72_2|k141_20768437_Rrc M72_2|k141_18538400_Lrc
M72_2|k141_6797120_L    M72_2|k141_13152222_Lrc
M72_2|k141_6797120_Lrc  M72_2|k141_13152222_L
M72_2|k141_6797120_R    M72_2|k141_7723568_Rrc
M72_2|k141_6797120_Rrc  M72_2|k141_7723568_R
M72_2|k141_7723568_L    M72_2|k141_9211005_Lrc
M72_2|k141_7723568_Lrc  M72_2|k141_9211005_L
M72_2|k141_7723568_R    M72_2|k141_6797120_Rrc
M72_2|k141_7723568_Rrc  M72_2|k141_6797120_R
M72_2|k141_9211005_L    M72_2|k141_7723568_Lrc
M72_2|k141_9211005_Lrc  M72_2|k141_7723568_L
M72_2|k141_9211005_R    M72_2|k141_10144865_Rrc
M72_2|k141_9211005_Rrc  M72_2|k141_10144865_R

in the first run, contig k141_9211005 is in COBRA_category_ii-c_extended_failed.fasta.summary.txt. in the second run, contig k141_9211005 is not in COBRA_category_ii-b_extended_partial_unique_joining_details.txt.

linxingchen commented 9 months ago

oops. can you share me the potential joins file? i can take a look to see which one is correct.

Best, LinXing

LinXing Chen, Ph.D. Associated Project Scientist, The Banfield Lab, University of California, Berkeley, USA 94706 Phone: (1)510-701-7864 Email: @.***

2024年2月12日 -0800 AM5:30 Hocnonsense @.***>,写道:

I've checked the results. Unluckly, the two results is not the same. I think this is caused by hashing of python, which will iterate keys (contig id) in different order in different runs. I've checked an example: related contigs:

M72_2|k141_22888658 GGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG M72_2|k141_10144865 TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAAT in the first run, this two sequences are in COBRA_category_ii-c_extended_failed.fasta.summary.txt -- they are Extended_failed category_ii-c in the second run, these two sequences are in COBRA_category_ii-b_extended_partial_unique_joining_details.txt: Final_Seq_ID Joined_Len Status Joined_Seq_ID Direction Joined_Seq_Len Start End Joined_Seq_Cov Joined_Seq_GC Joined_reason M72_2|k141_10144865_extended_partial 5623 Partial M72_2|k141_10144865 forward 2880 1 2880 36.744 0.452 query M72_2|k141_10144a865_extended_partial 5623 Partial M72_2|k141_22888658 forward 2884 2740 5623 23.535 0.467 the_better_one and the joined sequence is: M72_2|k141_10144865_extended_partial TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

linxingchen commented 9 months ago

Let me know when you have time to share the file. :)

Hocnonsense commented 9 months ago

Sure! Sorry for the late reply.

I've select all related sequences (there are only 10 seqs) and here are them:

input fasta:

>k141_13152222
TTTTTTATTTTTCTCGTCATGCGACTAGCGCCCGATGGCACCCGTTGGTGCTGAAAAGTTGCTGAGATTGCTCGGGAAACTTCGTAAAAGAAAGTTTTTTTTGATAAACAAGGGTATCATCGTCTTCGGACTGAAATGCAGCCCACCCTTTTCCTTGCCGCGCACTACCCATTTTTTTGTAAAACTTTCCCGGAAAAAACTTCCACTTTTTCCCTTTTTCAGTGCTCTGCAGTTCTTTCTCAATTTCTTCTTCAGTGAGGTGCTCCGTGGTTTTTCCTGTTATTTTGTCAAGTTGTTTGCGATATGCGTCTTCTTCAAGGAGAACAAAATATGCAAAAACTTCAGCTTCGTTCGTCAGATAGGCCGAAGTGCACACACATAAGCAGCAGGAATTTCCTAAGCGAGTGCACAGCGAACTAACGACTGAAATCAGGGCGTGGACGGAAGTAGGAGGAGCATGAAAAAAAAAATATGCTTTTGTCGATGAAACAAATATTAAAGATGGAGCTTCAGACCCAGCGCGAGCCCATTCATCCGCATGCAACAACCGAGCAACTAAATTCTCCCACTTTTCCACAGAACCTGGTGGCTCTACGGAAGCTTCGACGATAATCGAAATGTTCTTCTTCAAAATGGGGTTAGCCATTTTTTTGGCTTCCTCGATTTGTTCCTTGAGCCTACTGTAATAAGAAAGATTCATTCCTTGGGCAGGAAACTTTTCCTTTTAACTCTTTTAACTCTTTTATCGAGAGCGGACAAATTCAAGGAGAGTTAAACAGAAAAAGGATGGCTTCTAGGAATGAACCGACGCGACTGTATCTTCTACGGCGAAGTTGTCAAAGGCTACATAATCAAAACACTTATTGACGTCTTAGTGGGAAGTTTTAACCGGACTTGCTTTACGGTCACCAAGGATGGAGTTTTTTTACGAGAATGCGATAAAAATAGGAGCATTTTGTTTAATATAGAACTCTACAGAGAAAAGTTCAAAAAGTACAAGTGCGATGCGGACATCCATTTCAGTGCAAACGTAAAGCATATCCAGAGACTGATTCGAAATTTAAAAAAAAAAGATTCTCTGATTCTCGGTATTCGCCGCTCTTCGCCCGAGATGCTCTGTATAATGATCTGTCCGGCGCGAAAACCGGACAATACTAACTTCCGAATGGAAACGGCGGACATTCGAATCCAACTAGAGTCCCAACCAAATTCCGTCGTGATCCCCGATCCTTCGGTTTATTCTTACCCTTATGTCATTGACGCAGGAGAATTCCAGAAAGTTAAACGCATAGCCAGCGTAGCCAAAACGATCAGAGTTATCATTCGCGGCGACAATTATTTAGGCTTTTTGTGTGACAAAGAGATTTATTCTACTGCGTTACACTTTGGAGATCCCCAAGCGACCGAAATCCCGAAGATCTCTCGGTCACCCGAGGATCCGGCCGACCCATACGATTCGGTAGAGTCTGGGGAAGAAGAAGAAGATCTAAAATCTGGTACGGGGAAAGAAATGCGAGAGTACACTGCTGATTTCCATTCTTCTTTGTTCAACCATTTAGTCAAACTACCTGGGCTCTGTACCCAAATGCAGTTCTATGCCCCTATTGAAGAACTTTGGCCTTTACGTATTCGAATGGAAGCGGGATTACTGGGCAATATCGAAGTGTACATCAAAGACGTCCGAACTCTGGAGTACGACGAAATTGAGCAGTAACTATGGTTGCGACCAATTAATTCCTTTTTGTCTCAATTTTTGCTTTATGTAAGGCTCCATTGCTTCGGGAGGAATGAAATGTGGTATTCGAATCAAAACTATTCCTTGAGCTTGACATTTTAGGTCTTTCCACGTGTCTCGTTTTTGTTGGTAAACGAACTCCTTCGGTCCCTTTTTGTGAAAGTAAGCCCTGAATTTTGAGTGCTGAATCCCGTCGTACTCAAACGCAAGTCCTCTTCCGAACGGAGTTTTTAAATCTGCACAAAACCCGTCAAGTTCGAGTCTTTTTCCGGTGACTGGGTTTACTAAGAACTGCGGACGCTCGTTAGGAAAAGATTGCCGGAAGAGACGCTCGAATATAGCTCGACACTTTTCTTCGCTTTTGTTCGTACGGGGTTGCTTTGGTACTCCTTTTACCGAGGGTGACCCATCTTTTTTCGCAGCATTCGTGGCTTTGTCTTTTTGAACTGCTTCTCGTACCCCGAGACCCGAAATACCTTTGAAAGGGTACTTTTGTTTGGGAGCTTTCCAAGCGTAGCCTAGTGCTATTGCGACTATAACTACGACGAACACCGAGTGCACGCTAACCCATCGACTTTGGAACAAGTGGCGACAAGTACTCCACATTTATCACTCTAGGAATTTTCCTTATCCCTAATCTCTTTTTTGAGTAGCTTCGAATTTTTAAAGAAAATTGTAGATGAAAGTTCTATTATACTTTCATTTACTAAAAACATAATTCCTAAATGGAATAGAGATTAGAGTCTCTAGAATTAAGCTCAGTAATGTCGTGCTCGGTGGACGCTCGGGAACATACCCCTTTTGTTATATCTCGATTGCAACGTACTTCCCCAACGTTATTTCTAGTGACCCTCGACGGCACACCGCAAGGCTTTGCGGAAACCGAAGAAACTGCCCGTTTCTACGTTAGAGCGCTCGCAGATGATTTAAAATTTCGACTCGGAGTGCTTCACCCGGCGAACGAGTTTCAGATCGATCTCTCCCTACCCACGGTCAGCGTTTTAGAAAATCAAAAGGGTTTAATTTTCGATTCGGGGTTTCAGGTTATTCACCGAATTGCTTTCTTCTCCATTTCTCTGTTGGACGTGAATACCTTGAAGGTTTAGGGTCGCACAAATTTCCAAAAAGTTGTTAATTATGAAAATGGTTGACTGCACACCCCAAATTTTGTTAGGCCGGTTTTCTGTTGATTAAAGAAATCGTCAGTTCCTCGCTAAGGCGTTGGGGTGTCAAAGCACTCAAAAATAAAGACGAATTTTTTCGAAGAGCGAACTCCCGTTTTTAATTTTGTACAAATCAGGTTAAAAGAACGGTCCGCACTCATAGGAAGAATATTTTTGTGCATGAAACTTGAGGTGTGGGTCACGATTGACGTTCGGGGTAAAAAAGTCGGGTTCAGCTCTCTCGGGCGGTTCCAAGACTCGAATGGGATTCGCAAAAGCGTTGCGCCACAGGAAGACGGGTATTGCACAGTTATGGTGGATTCTCAAACCCACTCCTTCCACGACTTGGTGTGCACCGCTTTTCACGGGGTGAAATCCTCCCCTGACCTTGAGGCGCACCACATCGACCACGACACGACCAACAATCGACCCGACAATTTATGTTGGGTGACGCACCAAAAAAATATGCAAGAAAGTTACCGCACCCAAACGCGAAAATCGAGTGGTCCGCAACGAAGCCGACAAATCCTGGGACGGAAGCACAAGTCCACGAAGGAATGGGTCCCGTACGCGAGCATGAAGGCGGCAGCAGAGGAGCTCGGGTTAACCGTCGGACCCATCAGCGCCGTCGCAAGGGGAAAACAGCGTCAAACCGGCGGCTACGAGTTCAAATTGGCACCGTCGCCGGATTTGCCGGGGGAGATTTGGAAATTGCTCACGGTGAACACCAAGAAAGTTCAAGTGAGTTCCTTGGGGAGATTCACGGACTCCCGAGGATTGAAAAAGTCGCCCGTGCCTAGCCGCTCCGGGTATTGCCGCGTCAAGATCAACCGGAAAACGTACTACGTCCACCGACTGGTGTGTGAGGCGTTCTGGGGCCCCTCCCTTGGGTTGGAGGTCAATCACAAAGACCTAAACAAATCCAATAATCATTATATGAACTTGGAATGGGTGACGAGACGTGACAACACCTTACACAGTTACAGTACCAACAAAAACCGCCGTTCGAGTGCCGCCAAACAGAGCAAGCCGGTGTACGGCCGCAAGCACCAAACCAACGACGAGTGGGTGGAGTACCCGAGCATGAGCAACGCAGCGGGAAAGCTGAATCTACACTCAGGCGCAATCTCAGCAGTCACCAAAGGAAAACAGCACCAAACAGGAGGCTACGTATTCAAGCTGAAACCGCCGGAGGAAATATCGACGCTAAGTCTACCATGTTCTTTTAATATAAAAAACTCCACCCACTCGTCGTTGGTTTCGTGCTCCTCCAATAAACGTTAAAACAAAATGATTGCACCTAACAGACACGC
>k141_6797120
CTGCATTTCAGTCCGAAGACGATGATACCCTTGTTTATCAAAAAAAACTTTCTTTTACGAAGTTTCCCGAGCAATCTCAGCAACTTTTCAGCACCAACGGGTGCCATCGGGCGCTAGTCGCATGACGAGAAAAATAAAAAAATTTCAAAATCAAAATAATACAATACAACGGGAAAGAGTTAAACAATTTTGAATTTTCAAAATAAAAAAAAAAGACCCGAGCTTACACTCTCCGTTGAATAGGTTGTTCGGCAGTGGCCGGCATGCATTTAATTCCCTTGTTTATCAAAAAACTCGCATCCGGGGTATCTATTGCTTCAATAGGTTATGATTTTTTGTTTGAAATAAAAAGCAAAAAGTCAAAAACAAATAACAAGAATTAAACACGAAGAGCGTAAACAAAAACCTGAATTTGTCTTTATCCTTTCATCGCGCTGCGCGCGCCGCCGAAGGCGGTGCAAAGCACGTTACCCTGCGGGCCTCCGCTACGCGGAGTGAACGCTGGAAAGGAAAGAAAAGTCAAATCCCGTGCTAAAGTACGGGAAGGCACGGATGAGCGGTCAAAATATTGAAGATTACATTAATTGTCCGGTCACCCAATACGGCGAGGTTTTTTATTTTATTTTCCTCTAAAAATGGAAGACGAAAAGCGCGATCCCAAACGAAGTATGGTAGACGAAAAGAAAGAAGAATTCTTGTACACGATTAAGGGGATCGCTCTCCCGAACACCCCCGTGTTTTACTCAACCCTTACCGACCCACGCACCGACCCACACACTCGTTACACTATTTCGGACCAAAATGGTATGTGGGAAATAGTCGGAAAGGTGCAGGACATCCCCACCCCTCCTCACTGGTTTTTTTGGACATCTGAGAATGAGAAGCAGCGAAAAGTGGATAACGAAAATTTCTATGCTTTATCCTTCAACGGCGACGAGATCGATGATCCCCTCAAGTTCCGTTCGAGATTTGAGAACCGTGAGCTCTCCATTACTAAATTCTCCATGGACAGTCCTATTTTTGGAAGCTCCTCAAAGCCTGGAATTCAAACCGAAGAAGACGATTTTCAGGGATTTGGGTTGCGACGCCACCAGTTTAAGTTCCCCGGCGAACTGGGGATGACGGTCGCGAGCTACCTTGAGCCCATCTTATGCGGGTGGTGTGCTAGGAAGAAGGTCAAGGACAAGTTAAGTAAGTTGTGGGAACAGGCTTGGAATGAGTTCGGCACACAGTGGCGGATCAGCGGATCGATGCCTCCTTGGTCTAAGGGTGAATGGACCCTTCAGGGAGGCGAGCAGAAAGGGATGTACGAGTACCACGGTTACGGAGATGACCAGGCCAACGAGTATTGGTTGAGTCCGATAGTCGATCGCCCGTGGATAGAGAGCAACGTTCCTCACCCTCTAGGCTTTACTGTGAATGTCCCAGCAATCTTGCAATATCTCAAGGAATGGTATACTGACCAGTTGCACGATGACAGTTACGTGAAAGCGCACATTTGGCATCAGGACGGCCACTTGCGAATAAGCGCTGCTTCTCTTTGTCCGCCCTGCCCGTGCTCATTGCACTGGTCACCTCCCTACCAGAGTTGCTACAACTGGCTGGTGAGTCATGAGGAGTTCCTGAACTCGTCGTCCGGGAAGAGATGGCTTTTATCGAAAGGTGCGCAGGAATTTTTACGCTCTAGTTGGTTCTATGTATGGGCCTACCTTACTGACTGGTACGCAAAAATGCTAACCTCCACAAACCCTTCGCTTGAGAAATTAGCTCAGAAGCTGCTCAAAGAAAGACCAGTCTTCTTTCTCCAGTCCCGGGACTTTCGGCTTCTAACTGAATCTATAGAGGGACAACCAGCCAGGTCCTCCACCTCCATGGGCACGGGCACTTCGACAGTTGCGCAGTGGCGAGCCACCAATAAAGAGCTGGAGATCCAGCTGGGCAGAACTCTTAAGGATTGGACCGAAGCCGACGTGCTAGCTTGGCTCGTCCTCTTACCTCTCACGTACTCGCACCCTTCCCTTAATCCCTATACCCTGAAAATCGCAAAAGCCCTGGATAAAGACGTTATACCCCTGTTCAAGAACTTAAAAGTCACCGGAGAGTTTTTGAGCGACGATACCAAGGTTCAAGGTCTAGAAGAGGAAATCACTCGACGATGGCATCCTGATGAGGTGGTGCGGTCCGGCATTCTACACGGTTATGCTTTATGGGAGGAGAACTGGTTCTTGGAACGCCTTAGGGTGTTGCGAAACTCGGACACGGGATAAGCAGTGTGTGTTGAAACGAGCCGAGCGGCAGCAGATCCATCGAGAAATACCACAACAACCAACAGGCAAGTTTACTGAAAAGCCATACTTAAAAAAAAAAGGAACATAGACGAGAAGATAATCTAGCAAAAAATAATGCTTATTGAGAAAGATGCCTCTCTGCCGCAGCGACGTTACTCGCCGCTACAAGGGAAATGAGCCTTCGCCTTCAAAAAAAGGAAAAAACTATCGCAACTTGTAGCCAAAGTTTTTTCCTTTTTTTGTTGCCCGCCCGTTTCATGGACGCTCCATTGCATCCCCGAACGACAAGTCGCGGAGGGTCTTCCTTGAACCTTGCGCTCAGACATTTTGATTCGGAATTGACGACTGATCAGCACAGAAAACTTATAGTGGGTTACTCAAAACAACATGAAGTTCACAGTAACGATGGTCGATAGAGAAATAAAGCAAACAAAAT
>k141_22692969
GATTCAACCCGGACAAGTATATTTCCAAGTCAGGAAAAAAGAAAAATCCGTGTATGGCCACCCGTCTGGAAAAATTAAAATTGGAAATCGAAAAACAAATAGCACGCGTTGAAAAGGGTGAAAATGTTGAATTAGTAGAGAGGATATACATGTATTACGGAAGATACACTTAATAGGCTTTCAGAGGCTGACGAAGGCTACGGTGGTAAGTAGAACCTGAATTACGACTGGAGACGAGCCAAGTTCACTGTTACTCCCCCGGCCGTAGTGTCGGGGTACGACGGGATTAAAATCGTTTCCATTCCGTTTCCAATTCAAAAGAGATGGCAAATCGCCTGCTTCCGAAGAGATCGAGAAGCGTGTTTGTTTAACAAACACTGTTTAACTCCTCCTTTTTTTTTGAAAACATCCAGTTTAATTCGTCTGTTTGATCTGATGTGTTTTATTTTTCTCTCCCGAAAAGACAATGAGTACCCCCGCTGCCGTTACCGGAATGTTTGTAGACTTCTCTTCTGCGGCCTGGTGTGACTGCCTTGAAGAGCCGATTTCAGGTATGCATAGTCGATACGTGTGCATTACCGGGGTGAACTATATGACTAACACAATATTCAAAATTAAGACCATAGATGTTGACTCGCAAGTTTATGACCTTTATCAGGCCGCATATCTGGACCTACGCAAAGCCGACGGAAGCTCTGAACTTCTAGACATGGGTTTACCGCTTCCCAATTCTCCGGTTCCCCTTCGAGGAAACTTTTCGACCTTATTCGGGCTCCCGGGCAACTCTTACGGGCAGGTCGTTCCTGCAGTGCAATTCAACGTTGGAGATCTTCTTGTAGGGTACGTCCCTGTAGAATTTCTGGAACGGAGTTTGGGCCCAGTTTTGCGTTCCCCATGCACGCAGTTCCTACGGCGACCGTGAACAAACTACTATTTTCAAAAATTTGTGTTTGAGATTTTACATATTCTAGACGTGTAACAGCGTCTCCTGAGGAGAAAACACCATGCAGAAAAAAAACAGAGAAAAAGAGCGGTGCCAACAGAACACAACTCCGGGGAAGGACTTGACGACAGGAAACTCCGAAAAAAGATGTACCCATTCTGATCTTTCGGGTTTCTAACACACTCTGCACGAGCAAAGCACGTTACCCTGCGGGCCTCCGCTACGCGGAGTGAACGCTGGAAAGGGAAGAAAAGTCAAATCCCGTAAATCCCGTAAAGTACGGGAAGGCACCGATGTATTTTTATTTTTCCGTTTTGTTGTGTATTTCGAATTAGACACGTATACGGCTTCGCGGTACGTCGGGTTCTTCATCGTCATTAACATCGTACCGTACCCTATACGTGCCGCTGGCGTCGACGGCTACTATTTGTGCGGGCCACCATTCCGCGACATATATTTTCTCCTCGGACCCTGGTGGCGCGTACCGTCCGCTCCACCACGCTTCTACTTTATCACCGACTTTAAACCCCGTAGTCAACCCTTTCTGGATCCGTGACCCTTTCTGGATCGTTTTTCGTGGCCCTTTCGTTGGTGCTCGCCCTTTCTGTCGTTGCACGGACTGTGGGGACTGTTGGGCTCCTGTGTAATACGTGACTTTTCCCGGGAAGTCCGGGTCTTCGGAGATGATTAATGCCCCTTGCAGGGGCGTGACTCTCACGGCCTCGTCGACGACTAAATACTTTACACGGTCCGACTCACGTGCCTTATAGTACAGTGTATTTAATATTGAGGCACGAACGGCAGCGTCAACCGAACAAAGAAGACGAATCCTGAGTACTTCCTCTTTCCTATCCGTATCATAAAGAACAAGACCACAGATTTCTCTTCTTCGAATTGGATTAGAGTCAGGGTTAGGAAATAATCCCGTTGGAGCCCAGAACTTCTCTTCCACCCCCAGCCCCTGGAAGGGGCCTGACCCTGTGAACCGCCCCTTTATTTCTGCGCTGTCGGTATGCATATCGGGATCAATTGCGCTCTGGGTACGCATTTCGGGAACAGGCTCAGGAAGCGAGGGCTTGGTAGCGATCGCGAAAAACTGGCCGTTTTCCTTCGTTGGCTGCAGCAGCGTCTTCACTGTCTCACAGTCTAATTGTTGCCGGCACTGTCCGCAGACGACTTGATTAAGTTCTTCGTTGATGCGCCACTCTTTCAAGGTGGCGGGCAACTCCCGGAGAATACGGGGATCTTCGAGTATACATGACATTTTATCTTATCTTATCTTCAGTGCTTATTTTTTTCCTGTAAACTGGGAGTTGCTGGGCATTCCCATACTTTTGCAGCACTAGCTGGCCGATACAAAGTGGAACGCAAAAATTACGTAAAGACAAAAAGACGAATTCTGACATTTTGTTTCGGAATAAAGTATTTTTATTTTTATTCGTAAACCTTTTTGAATAATAAATTTCGCATTTAACTTTTTTGATTTTTTTGTAACAAGTACGACCGAAGGTACCACACCGCCGACGAGTGGGAGGAGCATACAAATGAACTTCAGGAGTCAAGTTACCCGTCGGAAGGTAGTGCTTGTGCTTACGCTTACCCACGTATTCAGCGATCAGCTCAGCCCTGTCAGGATCGGCATCGGCAAACCTCTGGACCCTATATAAAACACTTTTAAATGAACTTTAAACATGAAATTAAAAGTTCGAAGTTTCGTATTGTTTGGACACCTAAATATAGAAAAAATTTTCTTCGTGCTCTAAGTAAAAAAGGCCATCTATGGTCCGAACACTTTCGCAGAGCGGCTGCCTCCTTGAAAGGCCTCTTAAAAAAGGGTGGCTAGGGGTGCATGGACGCATCGACGAGCAGTGTGCCAGAAACTTCTGGTTAAGTCATGTGCGCAGATCATACCGGTCGTACAGAAATTTGGACCCGGAAAAAGCGTTGGAAAAAGCCCTAAAGGACAACGAATCGACGCTCTCCGCCCGGTGCGAGGGAACCTTGGACCTGGAGAAACTGTCCTGCAGCGGATATTGGTGGGACTATAAATCGAGTCCCTCATCCCGAGCAGTTTCTACGTCGCCGCAGTGGTCAGATTCCGAATGGAAAGATATCCAAAAATTCGCTCCGGGGGACCTTCTTGAACGTCTCGGAGGGGCAACAGAAACCGCGCGTTTGAAAGCAGAGGAAGAACAAAGAGCGCGACAAGCCGAAGCCGCGCGTTTGAAAGCAGAGTTCAAATTATCGGGAGGAGAAGAATACCGAAAATATGATGCTTGGCAATTCCTGGAGGCCACGGAAAAAGAAGAACGGAAACGCAAAGAAGAAGCAGACCCAAAACGCAAAGAAGAACAGAAACGTGAGGAAGAAGAACGGAAACGTGAGGAAGAAGAACGGGAGACCCGCTCCTGGCGAAAACGTTTGGCTCTGGAATACCTCTCGCCCGAAGTTCTGAAGAACGTGCTAGAGCACCTCCCGAAAGACGACCCATTGCACCCAAAAGTGTCCGCCCGACTTAAGTCGGCAGAAGAAGACGTGTCTGCCCGACTTAAGTCGGCAGAAAGGACTCAGTCGTTCCAAGAGCGATTGCGAACTTGTGAAGCCCTTTACTCGGAGTACGCACAAACAGCCGCTTCTCCGACGTCGTCCGCTCAACAAAGAAAAAAACTCGACGAACAACTGAGTACCTTCGAGTGCGCAGAGATAGCAGAAAGAGGGTGGGAGGAGAAGTGCTGGGAGTGCCAACGGGCACCTGCACCTGGAGAGTGCAGCCAGTTTCCTCCGAACTGTCAGGTGGTGACGACTCGACCGCTACGGTGGGCAGACGCCGTTTTTTTCCAAAAGGGAATAGAATGCGACTCGAAAGCGAGACTCACTAATACTCAACAGGAATTTTACGTAAAGCATCTTCGCCGGATCGTCAAAGGTACGGACGGAAACGTGGAGGACGCCCTTAAGCACTTCCGTCCCGATGTTTGCCTCTTGGGTAAGGGACGCAATGCGGACTACATCTTCATTTTAGCTTTATTTTTCACTTACGCTATTCTCCCTTCGCTGAAGGCCCGCCAGCTGTGCCTAATTCCCACTATTGAATTCGAAGTGTTCTCGGACAGCGAAATCGAAAAAGTTTCACCTCAACTGTTAGGTGATGAGCGCAAAGCTTTCGTGGCTTCCTGTGTGACGAGAGCCCAAGAAAAAATCAATCGATTTTTCTATCCCGTACTTCCTAAGTGGCACTGGGCGCGTTCGACGACCAAGTGCCTCAGCTTCGAGCTTCAAACACTTATCGGTTATCGTAAGGATCTCGGCTTCTACTTCACTGAACCAAAAAGGGACGTCATCGTGGTACCTCGGCTGTTCCTCCATATGCGGAGGCAAAACGGCAAATGGACGTCGGAGCCTCACGATTCTAGCTGGTTCGCGCGATGGGAAGCCAAGCGAAAGCTCGAACAGGAAATTGTGGACTCTCTGAAACCGCTAGCGCAAAAGAGTTTTAAGTCTCTCTTTGAAGAAACTAAATCTCTATTTCCTTCTTTTGCCCGTTACAACCTCGTAGTGATGCTCCCCAGACATGCGACCTCGGCCATTTTTCTGCCGGACAAGAGCGAGATTTGGTACGTCGATACCGGCACCCGGGCGTTTTGCGGACACGCCTTAGGCCCGGCCACAATCGTCCGGGACGTCCTGCAGGTTTCTCCTGAATTCCGGTTTTTCACTCCAGGGTGCAGCCACCTCGCCTACAGGCAACGAGGCCCTTCATGCGCTCTCCATTCTTTGTTCTTTTTCTGGTACGGGTGCTGTAACGACTTGGCCGCGGTGCAGAGATTGTGGAATTTTCAGTCGCTGTCAGACTGGAGAAAACACAAGGAGTTAGTGAACGCGCATCGCCGGGCTCCGTTTTTGTGCCCA
>k141_9211005
GAAACGCACCCCTCCGTGGCAGAACTCAGTCCCCTGAACATGAGAGGATTGAGGCTCGTAGTGGTGGGACATGCGGTAATGGTGATCTTCACTCCGGAGAGCTTCTCCGAACAGGTTTCGAGCTTCTTTTTAAAGTTCCGAGCTGCCTCCGTGATCTCTTTTTCCTGGCGATCCATTTTCAAAAAAAAAATAAAATTGAGTTCAAAATTTGGAGAATAAAAAATTAAAAATAAAATTAGTGTTTGATAAAAATGGGCTATACCGCTATAGACATATAGCCGATAAGCACGTCGCCAGTCCGTTGACCGTTGGTCGGTCACAAAACTGACGAAAAAAACGCCACCGTTGACTGTCCGCCGTTGACCGTCCATCGGCCAAAAAAAAATATAAAAATATAAAAATACCTTCTTTAATCCCTCTTTTCGATAAAACAAAAAACCTTTTTTTTATTCTCCAAGGTTTGGCATTCTAGTTTTTGTCTTTTGACCAATGGCCGGAAGAGGAAAGACGACACCGATAGCTCAAAAAACGATGGCTCAAAAAACGATGGCTCAAAAAAGTCAAAAAGCGGAACTCCAACCAGTTCCGGAGACCGATGAAAAAACGAGGGGCACGGACGATGCGAACAAGCACGTGGATGAACTTTTGAGGGTGCTTAGTGCTTTGGAGTGCGGTCATTCCATCCTCATCTTGCAGGATACGGATTCCAAGTTGATCCGAAAGTGGGGTGGCAAGGATCTCTGTGCGATTATGGAAAGTCGCTGTGTGAATAAATTCGCATGGACGCCCACGGCCGATAAAGTGGAGAATACCTGCGTGACCGGGACTTGGGCAACCTTTCTCAAAAACGGCAGCCATCCAATCTGCCGTTTCAGTGTGACAGATGAAAACACGGTGGAAAAACAAGTGGAACACATTCGAAACGTACACGAGCCCATTTACAAGCTGCTGAAGGCGATCCCCTTGGTGGCATGGATAAAAACGCGAAGCAACGATGCCGAAAAAAAGTACGGCCTTTCGTGGGATGTAGGGATTGGAAAGGGATTGGCCATCAATACGAGTAAAGACAAAAAAAAGAAAGGAAGATACTGGAAAAAGATTACCGACGGAAAGTGCGTTGAACTGCCGCTTGAAGGCTTCAAAACTGCGATCTTGCACCTGTGGAAAACAACCGCAAAACCCGCGTTCAAGAAAGGGCCGAATCCCCCCAATGCGGAAGACCTTGGTAACTTCCTCCAAACCACCACTCACGTGACCGTAGTGCTTAAAAATTGGTCTCGGGACAGCCCCACAATGGAAGAACTTCTCGCTATCATGGACCGCGAGGGGGAAAAGCTCCAGGACAAGCAATGGGTAACTCAAAAATCTGCTATTAATTTAAAAAAAATGTAATTCTAACAGTTATAGATAGTTTCATTGAATTGTATAATTCCTTTGTCGTTGCTTTTACTTTGTTTGTTGACATTCATACCGATTGCGTGTTAATGTGGTTTTTTATAGAATAAAAAACCTAAAAAACGGCAGAAGAGGACGAGGCCGGTACAAAAAAATCCGACAGCGACATCGGAAACAGAGACAAAAAAACAACGGCTCAAAAAAAAATAGAGGAAAAAAATTTGGGGTGTATTTTAACAATTATTCAATAAAAACTCCGAAATGGAACCGTTAATTACTGGAATTTGTAAACAATGAATCTTGGTACTCCCTAATTTTGGATATGGTCAGAGAAGAGAAAAAAAACAATAATTGTGAAGAAAATATTAAAATATTAACGGTTTTTTATTAGAAACATAATTTTAAATCCTCAATCCTTTTGTGATGTCTCAACGCGGACTGTTTGTCCTCCGATTTCGCGGGCAACGAGTCTTTGTCCTCGGGCAACCTCCTCCGCCACTTCCGTCCAAGGCATAGGGTGGTTTGTACTCGGTGTGTACTCTATTTTTCTTAACTTCGTCTTCTTCTTCTTCTTCTTCGGTTTGGTGTCCTCCCGATCGGGGTCCCGATTCCGCTTCTTCGTCTTCGTCTTCGTCTTCGTCTTCTTCTTCTTCTTCTTCTTCGTTCCGTTGTCTTCGACCATGACCTCGTCAATTTTGTCTTCCTCGGGAGAAAAGTGTAAGTACTCCAATTGCTTCTGCGATATTTTTCCCCCGTAATGGTTCCTAAGAAGCGCACGCATGTGCCTTTTCGTTCCATACACGGATAATCCGAGGGGTAAGCTCAATGTAGAGTTCGAACTACTGGCCCGGGTCGGACCGGATTCGATGTTCTTTGCCGCGTCCATCTCGGCGATGTCCAGAACGTGGATGCCTTGGGTCTTGGTGATGTATGAACGGCCATCTTCTGCTACTGTGTAGCTCATTTTGCCCATTCTGCCCTTGTTGGCGTCCCATGCCACGGTCTTTTTGTTTTTTGTGTAGCGTTCCTCAACTGTTTGCATCGCCTCTTCCTGCACAGCAGCCCTTTCTTCGTCGGTTGTGCAGCCTGCAGCCGGAAGCATTCGTTGGACCTGTGCGTTCACGACTTTTTTAGTGAACTCTGAAGCCCGAGCTCCGAAGTCTTTAAACTGCATCACTGTGGCATCTTCTTTGGTTATTTCTTTGTGCTCAGTGCTCATGTGCTCCTCTGCTTTTTGCTTCGTCGAGAAGGTCTTTTTTTTATAAGTTGCTCGGTTTTTGCATATTTTCAAAGTTGTTTTCAAAGTTGTTTTCAAAAGTTTACCTTATTTTCGCAGCCGCAAGAGACTCCCGCCCAAACTTCAGTTGCATACTTCATTTTGATCATGATGGCCGTCTTAGGTTCCGAAGAACCAATGTACGAATCGAGTAAAGTAGTGGTCTTCTTCTCCTTGCCTGTGTGTCCGCACGCGCACTTCACCGTCCAGGTGTTGCCTGTTTTCACGCACGCCGGTGAGGCGTGGAGCTTTTCCAGGAGATGCGGGACACATTTAAAAATAAATGTCCGTGTCCCTAGATGGACACATAACGGATCCTTCAGCTGAAACATCTTCAAAAAGTCGTCGGGATTGAAGGCTGGATACAGCATGTAGCACCAGGGCAAGAAGTCCCCGGGTCCGGCGGCGTTCACCGAGGGTCCATTGTGCTCAACGCATCCGCTTTAGGTTTTTAAAATGAAAAGACTTGTTGAAAAGGCGTAGGAACAAAAATTTACAAAGTTTACAAATTTTACGTACCAAATGGAAACGCACCGGAACGTAGCAGAACTAAGTCCCGTGAATAGGAGAGGATCGAGGGTCTTGGTGGTTGGACATTCCGTAATGGCGATCTCAACTCCTGAGAGCTTCTCCGGACATTTTTCGAGCTTCTTCCGACAGTTCTGAACTGCCTCCTTGTACTCTTTTTGCTGGTCGGGATCCCCCTTTGTCCCCTTATCCTCCAAATCACCAAGTGCACTCAAAAATTCATGCACGCGACGGGAAGTGTCGTTCATTTTTTCATCGGTATGGGTCTCCGGAACTGGTTGGGGTACCGCTTTTTGACTTTTTTGAGCCAACCTTTTTTGCGCCATCGGGAAGTTCTTTCCTCTTCCGGCCATTGGTCAAAAGACAAAAACTAGAATTAAATTGGTGTTCAAAACCTTGGAGAATAAAAAAGAGATTTTTTGTTTTATCGAAAAGAGGGATTAAAGAAGGTATTTTTATATTTTATATTTGTGTATTTTTATATTTTGTTTTTGGCCGATAGACGGTCAACGGCGGACAGTCAACGGTGGCGTTTTTTTCCGTCAGTTTTTTAATCGACACCGGCCAACGGACGGTCAACGGACTGGCGACGTGCTTATCGGCTATATGGCTATAGCGGTAAATATAGTGGTATAGCCCATTTTTATCAAACACTAATTTTATTTTTAATTTTTTATTCTCCAAGTTTTGAACATCAATTTTAATTTTTTTTTTGAAAATGGATCCCCATTGGATCCAAGAGTGCGCTGGCAAGGATCTCTGTGCAATGATGGAAAGCCGCACCCTTTACGATAACGATTTGGGAGCGGTCACACGCGGTCGACTGCACCCTGGGTGCAGTCGAAAAAGTGAAGAACTCTTGAATACCGAAAGTCGGTATATCTGCTTGACCGGGACTTGGGCAACCTTTCTCAAAAACGGCAGCCATCCAATCTGTCGTGGTTTGACAGATGAAAACACGGTCCAAAAACAAGAGGAACATATCCACAAGGCACACGAGCCCATTTACAAGCTGCTGAAGGTGATCCCCTTGGTGGCATGGATAAAAACGCGAAGCAACGACGCTGAAAAAAAGTACGGCCTTTTGTGGGATGTTGGGAATGGAACCGGCCGCGCCATCAATACGAGTAAAGATAAAAAAAAGAAAGGAAAGTACTGGCCACAGATTACCAACGGAAAGTACGTTGAACTGCCGCTTGAAGGCTTCAAAACTGCGATCTGGCACCTGTGGGAAACAACCGCAGAACGCGGGGTCAAGAAAGGGTTGAATCCCCCCAATGTGGAAGACCTTCGTAACTTCCTCCAAACCAACACTCACGTGACCGTAGTGCTTAAAAATTGGTCTCGGGACAGCGCCACAATGGAAGAACTTCTTGCTATCTTGAATCGCGAGGGGGAACAGTTCAAGGACAAGCAATGGGTAACTCAAAAATAATTTAAAAAAAAATGTAATCTCTCCTAATTATGAATTAAAATCCCTGTTGTAAACGTAGGTTCATTGAATTGTGTAATTCCTTTGTCGTTGCTTTTACTTTGTTGACATTCCTACCGATTGCATGTTAATGTGGTTTTTTATAGAATCAAAAAAAACGGCCGAATAAGAGGACGAGGCAGGTAGCGACATCGGAAGAGGGAAAAGGAAAAAAACAACGCCAAAAATAGAGGAAAAAAAAGTTTTGGGGAAGGGGGAGTATTTAACAATTATTCAATAAAAACTCTGAAATGGAACCATTAATTAATTACTGGAATTTGTAAACAATGAATCTATTATTGTCTTATCTTGCTCAATATTCACGAGACGATAATACTTGGTCGCAGTAGTATTTTATGCAGGATAACTTTGTGTTCTAGTTCGAGCACCGTAGTTTGCTGACC
>k141_20768437
ATTCAAGCTGAAACCGCCGGAGGAAATATCGACGCTAAGTCTACCATGTTCTTTTAATATAAAAAACTCCACCCACTCGTCGTTGGTTTCGTGCTCCTCCAATAAACGTTAAAACAAAATGATTGCACCTAACAGACACGCAGCGTCATGGTGAAGCGCGTTCTAATAAAGCCGATTAGACCGCTCCCCCCTCCGTGGAGTGCCCTGGAGGCAATCGTTGTGATCAATTTGCTCGATCGCGAAGATCGACTCGCGCACGTGCAGAAAGAACTAAAACTACACGGCTTGGACGGAGCCGCTTATATTCTTCGCAGCCAAAGAGAACAAAATGATTTTCTTAAAGGATGTTACGATTCCCATAGATATGCCACTACCCTTGGTCTTATCAAGGAATGGAACCGTGTACTAATATTGGAAGACGATTTTGTCTTAGATCAGAATTGCGGTCTCCGTATCGCAGAAAATGTGAACGTACTTCCGAAGAATTGGATGCGGTTGTTGGTTGGATATATTCCGATTGCGCCGTACTACGATTTTTCCTCGCAACTTTGGAAAGGCCTAACGTTGTGTTGTACCGGGTACGTTATCTCGAAGCAATATATGGATTGGATGCCCGTTTGGGAAGACGTGAGTACTATTTGTAAACCGTTTACCGTAAATTGTGTGAGTTTGAAAAAAGCCGACAATGGACTCGACCATGTTATGACTTATTTGACTCGCCGACGCACATATTTGGTTTTCCCGGCTGTCGTTTATGTCAACGGAAAGCTGAAGTCGGACCACACAGGCAATTTTTGGGACAGGTCATGTCACTCCCTGTCCAAGCAGAAGTTTTTACAATTTCTTTGGTTGATTGTGTATATCTTCGTATTGATTTCCGGGATAGTAATAATTGTACTTCTCAGGCGGGTTACCGCTTGACCGCCATACGAAACGTATTTTCGCACGGGAAGGCTCGGACGAGCGAAATCTTCCAAAATAATAAAAGTGAAAGTGAAATTGAAGTTGTGGTCATCTTGGTCTCGCATAGTACTTCAAATAAAGTAAAGTATGATCGGTGCGGTTCAGACGTACTGTAACCTTTTAGACCCGAATGTACAGGAAAGCGTTCTGGTCACTGCTACGGAAATTTTGGTTGCTGGAACGGTCCGGTCTTCCGACGCGCGCCTGCTCGCGGCAGTGTCCATTTTTTACTCGAGCCGTTACCACGACAGGTATCGAACTTTGGATATGATTGTTTCCGAGATGGCTGAGACTCCAGCAGACCACAAAACAACAACGAAAAGGGTTCGGAAACTTTTAGCAAGACTCGAAAAATTGTGCACCACCTCCCCGTTCCGGCTTACTTGTCGTACTCGACCTTCTTACTGGAAACCCCTCAGCGTGGACCTGGTCACGCATGCCTGCGTTAGACTCGGGTGGTCGGCTTCTGTTCGGAAAGCGGCTATCACGGTGTGCATGGAACTCCATCGATCTGACTTTGGCTTCTCTCTCTTACCTGTGTCAGCTGCAGCAGGCGTTCTGTATCTGATTGCGCTGCAGTTCGTCCCAGACACAGGTGCACGGGAGATTGCGTTCGTTCTTTGTGTTCCTACCGCGACAATAAAAGTTGTACATCGCCAACTCAAAATTCGAAGAGTGGCTCCCCCTATTTTGCTTTAAAATAGGGGATTTGACTTTCTGGAGTTCGCTTCTACGTGCTGTACTTCCGAAAAGGTCGAAAAAGATAAAAGATAAACAGGGCTTTCCCGTTTCGCTGAAAGTGGAAACATTTCTCAAAAAAACAAAGTAAGAGATGTTTTTTTTCAAGCAAACAGAAGTTGACATTTCAATTTATTTGCATTTCGATTTTCGTTCTTCCTTGAGACACCTTTCAAATCTCAAATTTCTTTTTTTTTGCGAAGACAAAAAGAAAAATTAAACTTCAAAAAAAAAACGAAACGCTTACGATGCACACCCCCGTTTTCGAGTCGGAGAAGCAATGTGTAACTTTACCGGACGGAACCGTGCTGTCGGACGCTTTCGTTGTTGAAGATTTCATGCAGAAATTTCTCTCCCTTGATCCCCGTCGTATTTTTCAAGACCTGCTGGAATTCGAGTGTCACGACACAAAAGTCGAGATCCCTTTTTCTCCTGAAGGAGACTCGGTGAAATGGGTGACGGGAGAGCATCCGGCTTTACACTATCGAGGGAACGCTCTGAAAAGGCGTAAAATGTGGTATCTCGGGATCTATCGTGGCTCGTTGCATAAAGACGGTGGTTCCTTGGCCAGTTGTTTGGGAGAACATCGCGAATGCGAAAAAAACTAAAGCGCAACGTTCGAAAAGAAAAATCGAAAGAGAGGAACAGGAAGAAAACGAAAAGAAAAAGCGTCGAAACTCGAAAAGCAAAAGCATGTATTAACATTACACATTTTGCTCACAAATGGTTTCTCTCCGGTGTTTATTAATTTGGAATACTAATTTTTCATTTTCTCTGCCCCTGACATTCATGCTTGGGGGCGACTGAGTTGAGACCGATTTCGGTCTCTTTTTTTTTGTCCGCATCCAATCTCATGTTTTTTTTCTTTTGGATTTTCTTTTTTGCTGGTTTCTTCGCCGCGGGCTTTTTGGACGAGGCCGTACTGGCAGAAATAGTAACTGTTTTAACGGGAGCACGAATCTGTGGACGAACGTGAACTCTTTTTCCGTTCGTGGAGCGAAAGTGCCCGCGGGCTTTATTTTTTTTCGTAGTGGTTTCGCGGCCAGTGAATTGTTTCGCACATGTCTCCCAGTCCGTTGTGCCCATGCTCGCGTTACATAGAGCACAGATTGCTCTCCCGGTTTTCCCCTCCTTTACATTTTGCGAGACGGTGCCCAACGTGAAAGTCCCAACGTGAAAGTCCCACACGGTAACCTCTGTCGAGCGACAGCACGGGCATTTTGAGCCGAACTCGTTGGACCATACCTTTCGGCGCAAGGCTTTAGAAATAGTCATTCTTTTGTGGTGGTCATATACAGGTGCGTACCCTACTCATTTTTGTTGTAGAAGACAATAAAGCGCTCTTTTTCTTTTATTTTTTATTTTTTCTTCGTGCTCTACGTAAAAAAAGCGTATATGGTCCGAACACTTTCGCAGAGCGGATGCCCCCTTGAAAGTCCTCTTAAAAAAGGATGGCTAGGGCTGCATGGACGCATCGACGAGCAGTGTGCCAGAAACTTCTGGTTAAGCCGTGTGCGCAGATCGTACCGGTCGTACAGAGATTTGGACCACCCGGAAAAAGCGCTGGAAAAAGCCCTAAAGGACAACGAATCGACGCTCTCCGCCCGGTGCGAGGGAACCTTGGACCTGGAGAAACTGTCCTGCAGCGGATATTGGTGGGACTATAAATGGAGTCCCTCATCCGGAGCAGTTTCTACGTCGCCGCAGTGGTCAGTTTCCGAATGGAAAGATATCCAAAAATTCGCCCCGGAGGGCCTTCTTGAACGTCTCGGAGCCGCGCGTTTGAAAGCAGAGGAAGAACAGAGAGTGCGACAAAAGGCCGAAGCCGAGCGTTTGATAGCAGAGGAAGAACAGAGAGTGCGACAAAAGGCCAAAGCCGCGCGTTTGAAAGCAGAGGAAGAACAGAGAGTGCGACAAGAGGCCAAAGCCGCGCGTTTGATAGCAGAGGAAGAACAGAGAGTACGACAAGAGGCCAAAGCCGCGCGTTTGAAAGCAGAGGAAGAACAGAGAGTGCGACAAGAGGCCGAAGCCGAGCGTTTGATAGCACTGTTCAAATTATCGGGAGGAGAAGAATCAATTCCTGGAGAAAAATATTGGGCTTGGCAATTCCTGGAGGCCACGGAAAAAGAAGACCCAAAACGCAAAGAAGAAGCAGACCCAAAACGCAAAGAAGAACGGGAACGTGAGGAAGAAAAACGGAAGCTCCGCTCCATCCGAAAAAGTCTGGCTCTGGAATACCTCTCGCCCGAAGTTCTGAAGAACGTGCTAGAGCGCCTCCCGAGAGACGACCCGTTGCACACAAAAGTGTCTGCCCGACTTAAGTCGGCAGAAGAAGACGCTCGCAGAACTCAGTCGTTCCAAGAGCGATTGCGAACTTGTGAATTCCTTTACTCGGAGTACGCAAAAACAGCCGCTTCTCCGACGTCACCTCAACAAAGAAAAAAACTCGACGAACGACTGAGTACCTTGCAGTGCGCAGAGATAGCAGAAACAGCGTGGAGGGACAAGTGCCGGGAGTGCCACCGGGCACCTGTACCTGCACCTGGAGAGTGCGACCAGTTTCCTCGGAACTGTAGGGTGCCGACGACGGCTCGACCGCTACGGTGGGCAGACGCCGATATTTTCCAAAAGGTAATAGAATGCGACTCGAAAGCGAGACTCACTAATACTCAACAGGAATTTTACGTAAAGCATCTCCGCCAGCTCGTCAAAGGTACGGCCGAAAACGTGGAGGACGCCCTTAAGAAATTCCGTCCCGATGTTTGCCTCTTGGGTCAGGGACGACAGGAAATCTACATCTTCATTTTAGCTTTATTTTTCACTTACGCTATTCTCCCTTCGCTGAAGGCCCGCCAGCTGTGTCTCATTCCCACTGTTGAGTTCGAAGTCTTCTCGGACAGCGAAATCGAAAACGTTTCACCTCAACTGTTAGATGAAGATCGCAAAGCTCTCGTGGCTTCCTGTGTGACGAGAGCCCAAGAAAAAATCAATCGATTTTTATATCCCGTACTTCCTAAGTGGAACTGGGCGAGTTCGATGACCAATTGCCTCAGCTTCGAGCTTCAAATACTTCTCGATTATGATAATGATCTCGACTTCTACTTCACTGAACAAAGGGACGTCATCGTGGTACCTCGGCTGTTCCTATATATCCGGAAGCAAAACGACAAATGGACGCCGGAGCCTCACGATTCTAGCTGGTTCGCGCGATGGGAAGCCAAGCGAAAGCTCGAACAGGAAATCGTGGACTCTCTGAAACCGTTAGCGCAAAAGAGTTTTAAGTCTCTCTTTGAAGAAACTAAATCTCTATTTCCTTCTTTTGCTCGTTACAACCTCGTAGTGATGCTCCCCAGACATGCGACCTCGGCCATTTTTCTGCCGGACAAGAGCGAGATTTGGTACGTCGATACCGGCACCCGGTCGTCGTGCGGACAGGCCTTAGGCCCGGCCACAATTGTCCGGAACGTCCTGCAGGTTTCTCCTGAATTCCGTTTTTTCGCTCCAGGGTGCAGCCACCTCGCCTACAGGCAACGAGGCCCTTCATGCGCTCTTCATTCTTTGTTCTTTTTCTGGTACGGGTGCTGTAACGACTTGGCCGCGGTGCAGAGATTGTGGAATTTTCAGTCGCTGTCAGACTGGAGAAAACACAAGGAGTTAGTGAACGCGCATCGCCGGGCTCCGTTTTTGTGCCCA
>k141_7723568
TCGGAACTTTAAAAAGAAGCTCGAAACCTGTTCGGAGAAGCTCTCCGGAGTGAAGATCACCATTACCGCATGTCCCACCACTACGAGCCTCAATCCTCTCATGTTCAGGGGACTGAGTTCTGCCACGGAGGGGTGCGTTTCCATTTGGTACGTAAAATTTGTAAACTTTGTGTAATTCCTTCTGTTCCAAATTTCCCTACGCCTTTTCAACAAGTCTTTTCATTTTAAAAACCTAAAGCGGATGCGTTGAAAACAATGGGCCCTCAGTGGTCTCCGTCCGACCCGGGGACTTCTTACTCTGGTGCTACATACTGGATCCAGCCTTCAATTCCGACGTCAATTTTTTGAAGGTGTTTCAGCTGAAGGATCCGTTATGTGTACGTCTCGGGACACGGACATTCATTTTTAGATGTGTGGTGCATCTCCTGGAAAAGCTCCACGCCTCACCGGCGTGCGTGAAAGCAGGCAACACCTGGAAGGTAAAGTGCAACGACGGGTGTGCACACACAGGCAAGGAGAAGAAAACCAGTATTTTGATCGATTCGTACATTGCTTCGGAACCTAAGACGACCACGACCGAAAAGAAGTACGCAACTGAAGGTTGGGCTGGAGTCTCTTGCGACTGCGAAAATAAGGTAAACTTTTGAAAACAACTTTGAAAACAACTTTGAAAATATGCCAAAACCGAGCAACTTATAAAAAAAAGACCTTCTCGACGAAGCAAAAAGCAGAGGAGCACATGAGAACTGAGCACAAGGAAATAACCGTCCCAGAAGATGCCAAAGTGATGCAGTTTAAAGACTTCGGAGCTCGGGGGGCGAATTTCACTAAAAAAGTCTTGAACGCACAGGTCCAACGAATGCTTCCGGCTGCAGGCTGCACAACCGACAAAGAAAGGGCTGCTGTGCAGGAAAATGCGATGCAAACAGTTAAGGAACGCTACGCAAAAAACCAAAAGACCGTGGCATGGGACGTCAACAAGGGCAAAATTGGCAAAATGAACTACACAGAAGATGAAGATGGCCGTTCATACATTACCAAGAACCAAGACATCCACGTTCTGGACATCGCCGAGATGGACGCGGCAAAGAACCTCGAATCCGGTCCGACCCGGGCCAGTAGTTCGAGCCGGGCCAGTAGTTCGAGCTCTATGGTGAGCTTCCCCCTGGGATTTGTCGCGTATGGAACGGGAAGATCAATGAAGGCGGTGTTTAGGGAGTTTTGCGGGGGAAAAATATCGCAGAAGCAATTGGAGTACTTACACTTTCCTCCCGAGGAAGACAAAATTGACGAGGTCGAAGACATAGTCGAAGACAACGAAGACACCGAAGACACCGAAGACAACGGAACGAAGAAGAAGAAAACGAAGACGAAGACGAAGAAGCGGAATCGGGACCCCGATCGGGAGGACACCAAACCGAAGAAGAAGAAGAAGAAGAAGACGAAGTTAAAAAAAATGGAGTACACACCGAGTACCCACCACCCTATGCCTTGGACGGAAGCGGCGGAGGAGGATGCCCGAGGACAAAGACTCGTTGCCCGCGAAATCGGAGGACAAAAAGTCCGCGTTGAGACACCACAAAAGGATTGAGGATTTAAAATTATGTTTCTAATAAAAAACCGTTATTTGTATTTTCTTCACAATTAGTTTTTTTTCTCTTCTCTGACCATATCCAAAATTAGAGAGTTCCAAAACTTCTAAAAAAATGAATCGCTTGTTTTCCCGGAATAAACTCCTGAAAACTGAATTCAGAGTCTTCCCGTAGAACCAATACAATGAGCATGGACTCTGAGACCCCTATAAACGACGAAGCGGTATGGCACGTCGCAGAAGCGTTTTTCAAGAAGTTCGGTCTGGTGTATCACCAAAAGGAGAGTTTCAATTCTTTTTTTCTTCGCTCAATTCCTGACATAATTCACGACAACATGCCAATTACGTTCGGGAACGGCCGTTATGCAGTTGAAATGCAAAACCCTCTCTTCCATGCCCCGTGTGTCGAAGGTGAAGGCACCGTTGTCTACCCGATGCAATGTATAGAAGCAAATCGTACTTACCGTTCTGAGCTTTCGGTAGACTTAATTGTACGGGACTTGGCAGACGGGTTAGAGAAGAGTCACCGGGCGGTTTCACTAGGGCTTTTTCCTGTAATGGTAGGGTCCGTCTTCTGTAACCTCGTGCAAAGAAATACAACAGAAAAACAGAAGTACGCTCTGAGAGAATGTCCGTATGACGAAGGAGGGTACTTCATCGTCAAAGGAACCTGTAAAGTCCTGGTCTGCCAAGATCGTCCCATGTCATGTTACAATCGCGTTTATGTGTTCAGATCACGTAAATCTCCGAACTATGCTTATTACGCGGAAGTCAGAAGCATCGCACCCGGCCGAGCCGGCCGAAGTACCACCGTAGTAGTGGGCCTTACAGAGAAGAAGAGCAACGTTCGACGTCTAACACTCTCAGCGGTCATTCCGTATATGTCGGACAAAACCCCGATTCCTCTCGGAGTCCTTTTCAAAGCGTTAGGAACTAAAGACGAACAGGAGATAGTACGAACGATTTTTACGAACGAAGAGCCGTCAGCAGCCGCGTTAGCCTTTCTGCGAGGAACTCTTGAGCAATCGTACGGCTGTGCGACCCGAGAAGAAGCCCTGACTCGTATTGGTAAAAACGGAAAACGTCACTTTTCGGCAAAAAAAAGTCCGGAAGGAGATCGTCCGCTAAGTTCAGCTCAGGCCGAAGCCACCATCCGCAATCAACTATTTTTGCATATTCAGCGCTCGACTGAGAAGAAAACTTGGGAAGCGAAACGCTTCTTCTTGGGGTACGTCGTCAAGCGGCTTATCAATGTAGCTTTAGGAGTCGAGAAACCTGACGACAAAGACCACTACGCGACGAAGCGCGCCACTACGCCAGGGATGTTGCTGGAGCGACAGTTTTCGCGGGACTTTCGCCGTCTTTGTAGCGACTTGGTAAAAGCGGGGGAAACGGCCTTGGAAAGGAAAAACACCATCGATGTCAAGACCTGGGTGAAGAATAAAGCGATGAATATTACTTCGTCGATGAACTATTGTATAACAATGGGAATGTTTGCTGGCAAGATGATTGGAGTCAGTCAGAATTACGATCGTTTCAACTTAATCGCTTCGGTGGCTAATGCGCGCAAGATTTCTACGCCCATCAACGAAAGCGGTAAAGTCGTTGGTCCGCGACAGCTGCATGGAAGTCACTGGGGTATATGCTGCCCGTACGCTACGCCGGAAGGAAAGAAAGCGGGTCTTCTTAAAGATCTAGCCCTTACTTGTCGAATCACGGTAGGTGAGAGTGCGGAAGGATTGAAGGAACTTTTACGTCTCGACCCGGAGCTAATAGACCTAATCATGCCGTCTCGAAGACACGGCCACAGCAAAGTCTTTGTCAACAGTGATTGGTGGGGGTGGACGCGGGATGGGTCGGCGATGGCGAAACGGTACCGAGCGCTACGCCGGAAAGCTGGACTGAGTCCCCTGACTGGCATCTCTTACCATCCGCTCAGAAACGAGGTGCGATTTTCAACGGACCCTGGACGGTTCTGTCGTCCGCTCTTCGTTGTTGAAAACGGACAACTCCTTTATTCGACGAAGCACCTTTCTATTGTTCAAACTGAGGGGTGGGATGCTATAATGGATGAGGGGATCGTGGAATTCGTGGACAAGGAAGAAGAAGAGTTTCTAGTGGTGCAGTATTCGCCTTCGTCACTCGCTCGTCTTCGAGCCGACGAACAACAGGTCGTAACGCACTGCGAGATTCATCCTTCCCTTATTTTTAGTGCCAGTGCGTCTGTGATACCGTTTCCTGATCGTAACCAGGCTCCTCGCAATTCTTACGCAGCTCAAATGTCCAAACAAGCTGTCGGGATTCCTGGCTTGAACTACTCATTTCTTGTGAAAGGTACCTACAACGTCCTAGACATTCCGCAACGGCCTCTCGTGGAAACGAAGGTTGCTTCGTTGCTGGGTTTCTCTAACCTTCCCGCTGGAGTGAATGTAGTGCTCGCTGTGTGTTCATTCATGGGATATAACCAAGAAGACTCTCTAGTTTTCAACCGGGCCTCTCTCGACCGGGGATTGTTTGGGATCACCCGCTTGCTGACCTTCTATGCAGAGGTAAAGAAAACCGAAGGAGAAGAGTTTGCTGTGCCTGAAAAGCTACAGGTTTCAGAGGGAACCATCGTCGGACGAACCGCTGCTGGGCAAAGACGGACGAGCAGCGACGTTACCCAACAGAAGGGGACGAAAGCTCTTTTGTTTCGCCCTTGCTGTAAAATCACGGGCAACGCTGCTAAACTAGATCCTCAGCTGTGCCACGTTCTTGCTCGAAAAACCATAGCTTGTCGGCTTTCAAAAGGCGGCTCAGTCGTTACACGAAGTACGCTCGTCGAAAAAGGAGATATTTTGATAGGGAGAATTACGAAAAATGCACCTGGAACAATTTACCCCGAACCTTATAGAGACGTAAGTATAGTTTATACAGAGACGCTTCCGGGCCATGTTCATCGAGCCGAACGTGGAGTGAACGCGTCTGGATACGAATTCATCCGGGTGGTGATCTCTCAGAAAAGAGGCGCCGCAATAGGAGATAAATTTGCTGCGATGCACGCCCAAAAAGGTACTCTCGGGAAAATCGTGGACCCGGAAGACCTGCCTTTCTGTGCTTCCGACGGCATCATTCCGGATGTGTGCATTAATCCTTTAGCTTTCCCTAGTCGGATGACAGTTGCAATGTTTGTGGAATCTTTAGTCGGGAAACAAGTTGCTTTGTCCCCCAAAGCCCGCAAAGTAGGAGCTCACGAACTTTTTATCGGAGATGGAACGCCTTTCGAACGGCTCGATCTTCAAGAAGTTGAAGCCGTCCTCACCAAAAATGGGTATCAATGCCGCGGAAAAGAGTTCATGATTGACGGAATGAGCGGTCGGCCGTTGCCTTGTAGAGTTTTTATCGGACCAGTTTACTATCAACGGCTCAAACACATGGTTGTAGATAAAATTCATGCCCGGGCTAGAGGGAGCCACACATCCATCACTCGCCAGCCAAAAGAAGGCCGACAATTTGGGGGAGGATTTCGAGTGGGCTACATGGAAAGAGATAATTTGGCTGGCCAGGGGTCGGCAGCTTTTCTTCGCGATCGTCTTTTGGAGAATTCGGACGACTACAAGATGTACTTTTGCTCAAAATGCGGTTTACCTGCCGTTATGTCACGGACAGGACAAGGCGAATGCACCCTGTGTAAATCTCGAGACGTGAAAAAGGTCAGGCTCCCGTACGCAACAAAATTACTTCTACAAGAGCTGAACGGAATGGGGGTGATGGTTCGCGTAGTGCCATCAACTTTTGGGACCGAACACCCTGAAATTGAACCTTACCAGGGACCTTCCTAAGCGCCCTTTGCAATGGGACTGAACTGCTGGAAAGTCGACAACGAGGCACCGGGGTTTTGCTCGACAATAAAAAAAAGAAAAAAAAGTTGCGGAGCTCAAAGCGCAGCTTGCGCCCATACGTCTACTCTCCGGTTACGAGATTTCGGTTGTTGAACAATAATGCTTTATTTTTACGAACAAACGTCCCTTCATTTAAAGAGAAGGACATTCATAGTAATGACCACATGCCTAAGGTGTGTGAGTACTTGAATTGCAGAAAACGTCCATCCTACGGTTATTTTTACGGAGAACCTAAAAGGTGTTCTACTCACGGGCTTCTTTGTAAGATGAAGCCTCAATATGCTATCTGCCGGTGCGGGAAAGCTCAGCCTATTTACAACGAACCGGGAAAAACACGAGCCGTGTGCTGTTCCCGATGCAAAACCGCATCGATGGTTAATGTCAAACACAAAAAATGCAAATGCGGGAAAGCTCAGCCTATTTACAACGAACCCGGACAAACACGACCCGTGTGCTGTTCCCGATGCAAAACCGGGTCGATGGTTGATGTCAAAAACAAAAAATGCAAATGCGGGAAAGCATTTCCTATTTACAACGAACCGGGAGAAACACGAGCCGTGTGCTGTTCCCGATGCAAAACCGAATCGATGGTTGATGTCAAAAACAAAAAATGCAGATGCGGGAAAGCTTGGCCTATTTTCAACGAACCGGGAAAAACACGAGGCGTGTGCTGTTCCCGATGCAGAACCGAGTCGATGGTTGATGTGGCAAACAAAACGTGTCCGGGTCAGGGACCGGGAATGTGCCCTACAGTAGGGAACCCAAAATACAAAGGCTATTGTACACATTGTTTTAGTCACTTATTCCCCACTGACCCCCTAACTTTTCAAATCCGCTCTAAGACAAAAGAAATTGCTGTGCGCGATTTCATCAATTCGGTATTTGAGGGTTTTACGCATGACAAACCGCTATGGACTGGATATTGTGACTGCACCCATAGACGAAGAATTGACCACCGGAAACTAATCGGAAATACGATGTTAGCTATAGAAACAGATGAACATCAACACAAATCTTATAAGAAAATGGACGAAGAAACTCGGTATAACGATTTATTTATGGCTTTTTCAGGGAAATGGATCTACATTCGATTCAACCCGGACAAGTATATTTCCAAGTCAGGAAAAAATAAAAATCCGTGTATTGCCACCCGTCTGGAAAAATTAAAATTGGAAATCGAAAAACAAATAGCACGCGTTGAAAAGGGTGAAAATGTTGAATTAGTAGAGAGGATATACATGTATTACGGAAGATACAATTAACCGACCATTTCAGAATATACAGTATACAGATTGTGGATATAGGATAACAGAAGGACCGAATAAACTGTCCTCGACTCCGAATTGATTCATGTTGGTGTGGGAAAAGTGTCACTTCGCGCTCATTTTTTTCCTCTTCCCTTTTGGTTTTTTACCGGGCCTCTTTTTCGCTGCCATCTTCCAGCGAGTCGCTGGTGTATACCTATCTATTTTAAGATCAATCATCTTGTCTGGTCCGGGCTTGGGCCAACTATACTTAACGACCCGGTTCAATAAATTGCTATTCCTATTGACAATGCGAAATCCTACAGCAGGCGGATCCCAGCTAGACCAGAGATCGCAAGACGATTGTACATGATCCTTATTTCCTTTTGCGTACAAGCCAGTGTCATGCTTACGACTTTTCGGGTTGGCGTGCTCCACAAACCACGCTTTGGCAGTCGTTTTTCTGTATCCACTAACGTCAATTCCCCGTTTGCTCATAAAAAACGTATCCATCGCCACTGGCTCGATTTTACCCTTGGTGACCGTCCACATTTCGTAGTATTTGTCATATCGGTGAGCTTTGCCTGACCCTATTTTTTGTGTTATTTGTACATACTGAAACAAAAACCCAGTTGCGGACGGTGTAACCTTAAACCGTGCCTGAGTCCCTGCACGCCATTTTGTTTGCTTTATTTCTCTATCGACCATCGTTACTGTGAACTTCATGTTGTTTTGAGTAACCCACTATAAGTTTTCTGTGCTGATCAGTCGTCAATTCCGAATCAAAATGTCTGAGCGCAAGGTTCAAGGAAGACCCT
>k141_11312800
TCTTGGAGTTTTTCCTGTTTTACAACAACTGCTTTTTCCTGTTTTTGGAGTTTTTCAACGTCTGCCACTAAAATTGTGTTCTCTGCAATTTTCTGTGCAATTGGCCCTACGGGGGCGGCCGGTGGCGCGGCCCGATTACAGGTCTTGTCATTTTCATTATAGAGGCATTCTTTTTCCGGTACGTGTGCACATTTCTCAGGCTGCTTATCAAGGAGTCGACAATCGTAGTCCGCCCCACCCCCGCGAGACTGCAAGTCCAAGTGGTCCAAGTGGTCCAAGTGGTCCAAGTGTCCCATGTGGTCCTCCCGGTGACGCAGACGGTGCATTTTGTCCATATGGCGCATGTGGCTATCGATTGACCTTCTCCGCTCGGTCAGGTGACTTCTTCGGTCGCGTGTGCGGTGAGAACTTGACGTTGTGTTTCCCATTTTTCACTTACCTAACATTTTTTTTTTATCAATTTTACGTGTTCGACTGCCACATTTAAAATTTGTATTGTGTTGGGCAGTGACACACAGTGAATGAAGTTTCAAACCCGCCTTCTAGTGGCAGTCTTTATTACCGCAAGCTTAACAGTGTGTATGGTTGCGTGGATCGTGTCCGGACGCCATCCTGTGTTGCAAGATTCTTTTTATCCTCCACTGAATCCCCCGCTTCACCTTCCTCGACAGTTCGGCTTGGCTTACGAGCGGACCCCTCAACGTCTCGACCTTGTGGTTGTACCAGCAGACCCGCTGGGACCTCAATACCCTGAAGCGGCCACCTCTTATTTTTACGTATGAACAGACGCATTTTCCATTTTTAAGATTATTTATTCTTTACGAGAAGATCCAAAGAACGCCCCGTACTTTAGTGCGGGGAGGATGCCATAGGCCAAGACGAAACAGCAACACTTCCGGAATACACTCCCCCGTGCACCACGGGAGCACCAAGAAACTTCTGAAAATGTCCCTTGAGATGAAGGGTCAAGCCGTCCAAGATCTCGAGCGGCACGGGCGACCGGGATTTTCCTCGAACCCACAACTTTAACACGGGCGAGTCGTGGGATAACAATACACCTATAAGTAACGAAGCAGTCGCGTTCTTGCTTTCAAGAAACTCAGCGATCCATGTTCGGAGAAGCGGTGTGTGGATACATTTATGTCGCAGAACGCACCAGACGCCCTCCCGGGAGAGCATTTCATCGGGACTCAATTGAAAACGCAATGTGATGCAGTAGGACCCTGACAAAGAGTCTGTTTTTTTCATCATGCGCGACCGCGCAACAAGCAACCATTGGTATTCAATCGTTTGGACTTAGACTTTTGAATTCATTTTCCATTGTTTTATGACATCCTCCCCGCACTAAAGTACGGGGCGTTCTTTGGATTTTCTTGTAAAAAAAAAACGAGTTTCGTCACGCGTAGCCGTCGGTCGGAGAGCAACTCACGGAGCAGGAACCGCCGCGCTGTCAGTTTTTGAACAAAATTTGGATCTATCAACGGCCTTTTTCCCGCCAAGAAACTCAACCCGGAGATCGTACGGCGACAACCATTCATCCCATTTCGAGTGCCAGCCGTTGAAATGTACCCTTATCAAGACAGCGTCCCAATCCGTGTCGACGACGGTGGCCACCTTTTTTTTTTAGTTCTTTTGAGTTTGTTGTGCCGAGGGAGGGAGAAAGATACGTTTGTTTACCAGCCAGTGAAAAACAGTGTCTTCCACATCAAGCTCATCTCCAACTTTCAGTTCGGATCGCCATTTCTCTCTCTTTCTCTTTCTTTCTTTATCTTGGTCGGAGTCCGCGCGCCCTCCCTGCCATCTCTGCGCCAGCGGTTCGTTGACTGCTTTCCCGGCGCGTGCGGGAACAATATTCCGAACAGCGACGTCTACTGCACTTGGGTCCACCAAGGACTGGACTTTGTTTGGAATAGTGGGGTAACTCTTTCCCGATCTTGTCCGCATTACGTAGAGATCGCTGTCAAAACTTTCTGCTGCATTTTGGTCGACCGAGGACCGCGGCTCGGCAACATTTGCTCGGATGAAGCGTTTATACAGCGAGCTGGCATAGCTTTTTCCCGAACGTGTACGGATCCCGGAGAAATCGTCTTGCCCGTTTAAGAGAGCTCTTTCCTTCTCATTCAAACCTTGCGCGTCGTTGGTCATAGTCACTGAGTTCCGTTCGAAGATGAAAAAAAAAATATGTGTGATTCAATTTTGATGGGTTACGCAGTATTCCTCTCGTTCTTCTCGTTCCTTTCGTGGCAGAGCCAGTTGTCGTGCTTCGTTCGTTTGAAAGAGCTCGGGTCAACCCTCGAAGTCCCCGCCTTTTTTTTATTAATCGGTTAATCGTACAAGTTAATCAATTCTTTTGCGTACCAGGTTTCCGTGTCGCCATTGGGTAACTTCCGGATTATTTTGTTAGGAATCTTTTTCTCGATTAGTTCCTGAAGAGCTAAATCTAACGGACTTCGTCCTCCAAAAGGATCATCCACCAGAGGAACTTGTCCTACACTTATCTGCTCTGCCCTTACACCAAGGACTTTGGCTCTTTCATATCGGGTTAGGTACGGAGGAGTTCGTCGAGTAAATCGGTCTTCGCTCATTGGCAGCCTTTTCCTCGCCCATGCTCTTTTCTTTATCTATTTTTATAACTCTATTCATTTTTTTTGAACCGACGGAACTGTAGCCGGTAAGTTCGAAAAAGTTTCTCCTATACGCTCGGTGCCTTCACCGTGCTAAAGCACGGGGATTTGACTTTTCTTTATCCTTTCCAGCGTTCAGGTTAACGTGCTTTGCATCTGAAAAGGATAAAGAAAGACAATTCAGGGCTGGCAGCGGGGGTGTACCTCCCTGGCTCGCTCTACCTCTTCTTTTTTGCTTTTATGCAAAGCACCTAAGAGTATCGGTATCTTTCCTCGGACCGACTCGAGAGGAGACAGCGCCCAATTCCGCACAAACGGAACAATCCGGGCCACCGGTAGTTCTGTCATCCACTTAAGGGTGCCGGCGGCAATGAAGGAGGTCATTTATACTAGCCTCTCAAAATTTACTTCAGATGGTAGGTTCGACACAATTCTTTTCATTATTTTGTTCTATATAAGGGTATCCGGTTTCCCCAAAAGCTCACCCAGTGAAAAAAGTCTGTAGGACGCATTATTATTTTTGTTTCCTCTTTACGCATTCCAGATGGGTTAAAAAGAAGGGACTCTATGGACATAAGCCTAGTATGGTGGATTCCGAGGGCGAAATCGAGGTCTGGGTGTCCCTTTCCGTTCGGGGGAAGCTCGTGAGCTTCAGTTCGCTCGGTCGATTCCAAGACTCTTTCGGAAGCAAAAAAGAAGTTGCCCCAGCCGAGAGCAAATACTGTCGTGTGGGGGTGGGGAATAAAACGTTTCAATTCCACGACTTAGTGTGCACGGCGTTCCACGGAGAAAAACCATCGGCGGACCACGAGGCTCACCACCTGGACCATAAGCCCGAGAACAACCGGCCCGATAACTTGTGTTGGCTGACCCACGAACGAAACACCCAAGAAAGTCACCGCACCCAAACGCGAAAATCGAGTGGTCCGCAACGAAGCCGACAAATCCTGGGACGGAAGCACAAGTCCACGAAGGAATGGGTCCCGTACGCGAGCATGAAGGCGGCAGCCACGAAACTCAAGTTAGACGTCGGACCCATCAGCGCCGTCGCGAGGGGACGCTGCCGGCAGACGGGCGGGTACGAATTTAAATTCGCCGAACAACCCGACTTACCCAGGGAAGTTTGGAAATCCCTGACGGTGAACACCAAGAAAATACAGGTGAGTTCCCTGGGCCGATACATGGACTCACGAGGATTGAAAAAGTCGCCGGTGCCTAGCCGCTCCGGGTATTGCCGCGTCATGATCAACCGGAAAAATTACTTTGTCCACCGACTGGTGTGTGAGGCGTTCTGGGGCCCTCCCTCGGATTAGAGGTCAACCACAAAGACGGAAACAAATCCAACAATCATTATATGAACTTGGAATGGGTGACGAGTCGTTACAACATCTTACACAGTTACAGTACCAACAAAAACCGCCGTTCGAGTGCCGGGAAACTGAGCAAACCGGTGTACGGCCGGAAGCACAAAACAGACGACGAATGGGTGGAGTACCCGAGCATGAGGGCGGCAGCGCGACAGTTGGACCTGAAACCAGGCCCAATCTCCGCCGTCACCAAAGGAAAAAGAAAACAAACAGGAGGCTACGAATTCAAGCTGAAACCGCCGGAGGAAATATCGACGCTAAGTCTACCATGTTCTTTTAATATAAAAAACTCCACCCACTCGTCGTTGGTTTCGTGCTCCTCCAATAAACGTTAAAACAAAATGATTGCACCTAACAGACACGC
>k141_10144865
TGCAATACCCGTCTTCCCGTGGCGCAACAGCTTTCCGAAGTCCATACGAGTCTTGAAACCGCCCCAAAGAGCTGAACCCGACGTTTTTACCCCGAACCTGAATCGTGACCCACACTTCGACCTCGACCTCAGGAACCATACCCGAAGGGGGAGTTAACTATATTCTCTTTCGGTGTCTTTAACGCATGTAACGGTGGGAAACGTTCCGGTACCAGGTTCACGACGACGGTATTTGATCTTTTTTTTATTGAAAATGTATTGCACAAACAAAATTTTAAACACCATAAAAAAATGTTTAGAGCATGAGTGCGGCAGCGCGGAAGCTGGACCTACACCAAGGCTCTATTTCCGGGGTCACCAAAGGAAAACAACACCAAACAGGAGGCTACGAATTCAAGCTGAAACCGCAAGCATAAAGGTCTACGTAATTATCTAATTATCTCTATCTAGTTTTGTCTGGAAATCGTACGCCAAAGACGGTTTTTGTCCATCTGATTTTTCTTTTCAACATATTGTCTACGTCATTGGCCGATGTCCCTAACGGGGGACGCGCTTAAGCTCTACTAACCGAGCCCCACCTTTCACCATTGCCTCAATTTCCTCAGAAAGTTCCTCCCGTTTTCGTCGAATTTGCTTTCGGAATTCGTCCTGGAAGTCCTCGAGTTCAATTCCATCTGCCCTAATTTGAATAGTGACCAGGTCGTGGAGTACCTTTCCCATTACTTTTTTGTCCTGGTCGTCAGCGTCGCCCTTCTTTATAATCGTACTCTTGAGGTTCTCCCAATCTTGTTTGAGTAGTTTGTATTCGCTCAGTATCCCCTGAACGGTATCCGCAGAGAAGTAGGAGTAGGAGGATGGCAATTTTGGCGCGACGAAGTTGTCCCAGCACTCCTTCACGGTCTCTTTACAATGCTTGCAGTTCATCGAGAGCCGCTCTAAAAAAGCGTGTATCGCTTCCACAGCTTGGGTTACTGCTTCACAGCATCCCCCTTCCGCTTCCCCCGACCACCAATCTCTCCACCCCCCTCGCGTAACAAACCACGGCGATTCTCTTTCTTCTATCGTTTGCCTCTGGCCGGCCGCAATAGACTTCAGTTCGTCCAACAGATTTGTTTCCGCTTGAGTCAAGTTCTGAGGAGCCTCCCCAATGTAGTTCCAATAAAATTCTTTGTTTCTCAGGGGATAGAGGTTTTGTCGTATTTCCAAAGGACGTGGGTCCACCAAGTGCCCCATTGCAAAAATTTTGCGACAACAAGAACTAAGCGGATAGTTGGACGTATCTGACCAAGAGGGGAGATCCTTTCCCTGGGGCAAGTTGAGCAGGTAAAGAATTTTCAAGAAAAATTTGGTCGCCTCCATATATGAACACCGGCGCGCAGCGCAGTGGTCCCAGTCGTGCTTTCTACCCCAAAAGCCGGTATACTCTTCAACTGTACACGAACATGGTATCTGCTTACCTGCCCGCAAGACGCCGGCTTCGCACTTGGTCAAACATTGCTTACCTTGTTCGGAGACGCCTTTGTCTTTGTCTTTGCCATAGGGTCCTTTTGTATTTCTTGCTTCCTGTAATTGTTTCTCACGTTCCTCTTTGGTTATGACGCGGGAACTGTCCGTTTTTTCTTCAACTGACAGCATGACGATTTACTTCTTCAAAATAAAAAATGTATAAATGAATACCTATTTTTTTTTCCTTTGATGTACACAATTCTTCGATATTTCAAAAGGTCAAGCCCCTTGCTCTCTGGTACCCCGCTGACTGGGATGACCTTCCAGACTGCGGTTCCAACTTCAGCGTGTGCGACGACCCCTCCAGCACAAGAGTGCAAAATAAGTGGTTTTATTAAAAAAAAATGGGAAATTTATTTGTGAAATAAAGAAAATGTCGCGAATCTATTCTTAGTTTTTGCAAAAAAATACAGATTTTATTTTTATTGGTAGAACGGAAATATGGGAAAGTGTATTTCTCTCTCCCACTGGCGAACAGGAGATCGGAAGTGCTTACCCCGAACCCCCGAAGCCGAAGAGGCTGAAATTGACGCATGATCGATTTTGAATTTTTTTTTGTTCTGGTGGGCACAAAAAAAATCCTTGTATAAAGAAAACTGAAAGACGAATTCATTTTGCATGAAAACAAGAGCGCAAATCCCGATGAGTAAACAACTTTATCAATCCCCGGAGAGTATCCGAGCCATGAAACAGAGCCAACGAATAGAGGCACTCCAAGAAATCCTATTCGAACATAAACAGAAGATTCCCCAAGGAGATTTCAAAACCGGTCAAGACATTTTGAAAGCACTATTTGACGAAGAAGAACGAAGACGTTTCGAGAACGAGTGCGTTTGGTACAACGTATTTTTCATTCGTCTGGTTCAGGAAGTGGACGTTCATACCGGACCAGGGGCCGGACCAGACGACTGCACCAACTTGGTTATTGACGATTGCTGCGGCCAAACGACTGTCACGCTCACGCCGAAGACGGAGAAAGTTAGTTTAAGATTACATCCTTTGTTAGTGGAGTTAATCAACGAGCGATGTTGCTTCATCAACCAATCTTACCCTGCGGACGGCTCTTCGGTTTACGACCCTGGCTTCTTGGAAATTTTCCTCCAAGAAAAGGGTAGAAACGAACTGCAAGAACTGCTTGACGCTCAGTACCGAGAGACTATCCACCAGATCACGGACAACCTACGAGGACGTACGCGTGTTTTACAAACCCGGGGATGCACCGAGACGGCACTGGGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAAT
>k141_22888658
GGTCAGCAAACTACGGTGCTCGAACTAGAACACAAAGTTATCCTGCATAAAATACTACTGCGACCAAGTATTATCGTCTCGTGAATATTGAGCAAGATAAGACAATAATAGATTCATTGTTTACAAATTCCAGTAATTAATGGTTCCATTTCAGAGTTTTTATTGAATAATTGTTAAATATTCCCCTTCCCCAAAACTTTTTTTTCCTCTATTTTTGGCGTTGTTTTTTTCCTTTTCCCTCTTCCGATGTCGCTACATTATTCGAAAATAAAAGTCACCTCTGGGGTCAAGATTTGCATTCTATTAAGACAGTGTACTAGAAAGCCCGCCGGAAGTAAGGTTAGCTAGATGCGTTTGAGATAGGTCGCTTCCGGGTGGAAAAGCCGAAAACGTATCGGCTGGTTTTCGAATTGCTTTCTTATGAAAGACGACCGGAGGGGACGGCACAACCAGAACTTGGATGCCTTCGTAGGGTAACTCTAAGTCTGGGCCTTCTCTAAATTTATAAAGAGGAACGTTGACTGACCCCCGAGCTACTACCAATTGGATTCCTCCGGCGAGCATATATCCGTGAGAAAGTAGGGTAGGAGCAGCACAAATAATAGCGACACACTCTGCAGGAACTGCTAACGTTACGCCTGTGTCGAAGGTCGTCACCTGGTTGACTTCGTCTTCTTCAACTCCTTCCCGAGAAAGGAGCAGCATCTTGTACAAACCGGTCTGCACATCCTTTGTCGGAGTGACGGCTCCTCCTTTGCGCAAGTGGACGCCTACTCGAGCTGGACGACGAGAAACTTTAGCACGAATGGTGGAATTCGTTAATGCATGCATTTGGATTATCCTTTTCGCATATATCTTTATACCAAGTGCGCAACGCAACCTTCTTCTCATTTTAATCTGTGGGGGTCTCGGACCCGCTCAATTTTCCAACGCCGCTCTCCGACCCGAGTGGTGCATTTTTTTCGTTGATCCTCAAGAAGAAATAAGAAATTTAAAACAAGAGGTTTCCGGGATAAAACGCGAGGGCAGCCGATACGTCTTTCAATAGCGTACTTTTTTTCTCAGTGTTACGGAAATGGCATCCGCACCGTCATCCAAAGACTTCGTTGGTCCGTCCGTCTGGACCATCCTACACAGTTACGCTGCGGCGTATACTCCTCGTCAGCGGCGAAGCTTTACAGCTTTAGTTAAAGAGCTAACGGGCCTGTTTCCTTGCGATACATGTAAAGAAAATTTCAAGAAAAAGCTCGTTTCAATGCCTCTGGAAAACTATCTCGGAAGTAACCATGACCTGTTTTTTTGGACTTACACGATGCATGATCTAGTGAACCAAGCGGATCCGTCCCACCGCAAGACGTCGCCCCCGTACGATCAAGCTAAATTTATATACTTTGGAGCCATGAAAGAAGCGTGCGAAGACTGTTTCGTTGTCCCCTAAAGAGCTGCTTTGACGACTTTCTGTATGCTGCGAAACAAGGTACTTGCATCCAAGATCGGGAAGTCAACAACATGAAACGAGGTTTTCGTAGTTTCTCGTTCGATTCGTCTTTTTTTTATTTAGAGTACATTAGGTCTCCTGAAATGGCAACGGAATTTGTGGTGACTGCAGAGTTGATGGAAAAAATTTCGAAAGACGAACGACGAATCAGTTTAGACGTTACCGGCGCAACACTGGAAATGCCGCTGACTTGGGAAACGGTGTGGCGCTGCTGGGGATTAAGCTTTCCTTCTCTGGAGAAGCTCAGTGCTAGCACGCTCAAAAAACAGGAAACTGTTGACCTGCTTATTGCTTTGTTCCGAGGAGATTTTCCTAGGCTGCAAGTCGTCCGAGCTCCCCTTACGCGTTTCGGGCTCTTGCCGTTGTGGGCCGAAGTTTTGAACCGTCGAGCGTTCTTGGGTTTCCCCTCCCTTCATTTCGAAAGATAGAATGTCAAAAAAGAAAAAAAAAGAGATAAACTTTTTTTTTGTTCATACTGGTACCGTGAAGGCACTAGCTACCTCGTCTAAAAAGTCTCTCAAATCCCTCGGGGAACCGCGGGACGCAACACAAGCGAGGAAATTTCCGAACTCCTCGTCATTAACGCAACATCCTTCGGGACATTATGTGCTCATCTCCGAAAGTAACTCGCTTTGCAAAACACGTAGCCAACTATAGTGGGTTTTTTTTTTGACGGGTGGAGGAAAGGACCGATGGAGCCAACGCAATGTGAAGAACTGGAAGACATTATCTCGGAAGTGTACTCCTGTAACTTGGATGACCGCGCGTTAGCAAGTTCTTGCGCCTTCACACGAGATTGGTCAGAGGCAGCCTGGAAAGAATTTTGGGACGAAGTCTTGGCCAACCATGACATGGTCGGAGAGGAATATATTTCGTGCTTGAACCAAAATTACTTCGACGATGACTTCATGCACGCATGGCAGTCCGCCCAAAATCAGGAGTGCGCTCCTTGCGACTCTGGATGCGACGAGGAGTTGGCCCAGCACATCGCTCTCGTGAACGGCGAGGACGAACCGCCCCCAGAAGAATTGTGTGCTATTTGGGAGCAGCTGACTCCCGGGGAGTTCCGCCAACAACAAGTCGAATCGGAACCATATTGGAGAGACTTGGATGACATCTGTGTGGGCCCTGGCGATTCCCAGAAATTCTGGCCCCAGGTGCACACCGCCCTCTTGAGGAAATGCCAAAACGGCAACGGCAACGGAAACGGCAACGGCAACGGCTTTCTTTGGATTTTCTCGTAAAAAATTTTTTCTCGGGGATAAGAAGAAATGCAAGACGACTATTGCGCAGAGATGCAAAGCGAGTTACTTTCGGAGATGAGCACAGAATGTCCCGAAGATATCAATAATGACGAGGACTTCGGAAATTTCGTCGCTTGTGTTG
>k141_18538400
CATTCTTTGTTCTTTTTCTGGTACGGGTGCTGTAACGACTTGGCCGCGGTGCAGAGATTGTGGAATTTTCAGTCGCTGTCAGACTGGAGAAAACACAAGGAGTTAGTGAACGCGCATCGCCGGGCTCCGTTTTTGTGCCCATCTACCCCGGGGGACACCGTGATCACACAGAAGATCGCTGCGCTCCAGATGAAGAAGGACCCGAATTTATTATTTCGGTCTACGCCAAGGGATCCCCGCACAAAAGAGAAGACCGTAACGCTAAAGGGGACTAAGCGGTGGTGGACTCTGTTGGATTATTACGACTGGGAGATTTCGAAACTCGACCCCTTAAAAGAAACGAAAGTTCCTCTCGGCCCATTGGAGAAAGAACGCCTGCAGACGTTATCAGAGCTTAGGAGGGAGGCATTCACTACCGGCCCGCGGATATCCAGCCTGCGCCTGACTAAGGACGAAGAGAAAAGATTTCCCAACGCAGCGAAAAAGTTAACTCGAAGACGCAATTTTGAAAACCGAGAACGAAGGGCGAAAACCGGGACTGACAGTTTAATTACTTTCCAAGTCTCCAAGGTTTACAATCAGTATGTAGAAGTTCTAGTGTCGGAAATCGACGGGCTGAGTCGTTTTCTCTTCGATCGTTGCCTAGAGGATTGTACAACAGAAGAGGAAACGATTGACGAGAAACGGCTAATAAATGTGATCTCTCGGTGCAACTTTTCATACCTGGAGTTATCGGATCTTCTATCGGATATATTCCCGGTAGTCGAGTGGATGAAAAAGAACTTCGGGATCGAACTCGGACGACCGGGCTCGGTCGTGTGAGTGGGCATCTTTTTTCCGTTTTTCCGTTTTTCGTTTTTCTTTAGAAAGCATAAGGAACGATGACTACAACAGCCCCCTTCGCAATTGACTTCTCTTCGGGTCACCGGTGCACTTGTATGAACCCGTCAGGTGTTCCGGACGTGGAACACTGTCTCACGGCTATAGTGTTCACCGTTCCGTCCCAGCTGTACAAAGTTCGCTCGTTGGCTGTGTTGTGGAGAATTTGGGAATTGTACAATGGGGGCCAACTAGATGTTCGTCTTCCCGACGGAACTCAGATCTCTTTGCCGGTTGAAAACGGCCCTATCAATCCGCCTGACCACCCGGGTTCCCCGCTTACAGCTGATTTTTCCAAGCTTTACGGGTTGCCCGGGAACTCCTACATGGAACTGGTGCCTGCACTTCAGCTCAAGCAAGGGGACCTTATTTTAGGGTACGCTCCGACGTCATCTTTAGAACGAAATCTCAACGTGGGTTTGCTAGTCCAATGTTCCGGGTTTCTAGACCCCACCGCGCAAACCGTCCAAGACACCGTTCCGGCTCTTTCTGTGCCATTGCCCGTTACTTTTAGACGGGGGAACCTCTAGTTCGAACTTTTAATTTCACGAAGTTCATTTAAAGAATAGCGGTACTATAATTATGTTACGATTGTAGCGGGAGGAGTGTTCAAAAACTTTCATCCGCTGTCGGCAGCCTACTAACTCCGTTACCACCGAGTGGAGAGACAAGCAAAACGAAATTTTTGTGAAGAACACCGGTCTAGACCGGAAGCTTCGCGGATGGTAAAAGACTTAAACTTCTTTTCTTCTACAAAAAAGAAGAAAAACTTAATTTGATTGTTTGTTATTTGTCCAAGAGACTGATTCTGTCGCGGTTTATTTTATTTTGTTTTTTTCAGTTTTTTTCATTGTTTTTATGAAAACTTCCCAAGTTTTTCTTGGGACATTCTTTTATGGCCTTCAGCCTAGATCTAAAAAAATAAAATTTTGTAGTACGTTTTTTGTTTTTCGGGGAAAAACAAAATGTGGGAAATGACTGTAGGTAAGGCCTGAGCGTTGTCATCTTCTAGGAGCTGCTGCGCCGTGTGGTCCACATACAGCACGCCCCCGTGACGACTACAGCGGCGGCTGGCAACCTCCTGATAAGAAGGGTCGTTTTCCGTTAGGTCCCTAATAGTACGGTAGTACACTTTTTTTTTATTGTTAATAGAAAAGAACAGGCCAAAAGGTCCTGTCATCTGAATAAAGCGGAATTAGTTAAGCACTTAGTGGCCACCGGGACAGGCAAACGTAACTGGGAACAGTGTAACGGAAACGCTTCTCGATCAAGAATTGGTATTTTTTATTTTTTCCTAATATGAGATAGATTACGCGGAGCAAACGTCGCAAGAGTCAGGTTTAGCTGGTGGGCCTAAGGCATGATTCACCGGATTCGTCAACGCGCTCTGTCGGAGATAATACATCCCCGTTTTCAGGCTATTTTCCCACCCGTAAAAATGGTACGAAGTGATTTGCTGGAAAGTAACGTCGGGCCTGAACCAAGCATTCAAACTCTGGCTTTGATCAACAAAGGCGTTTCTATCAACCGCGAGATCCAGCAAGTGTTTCTGAGGAATTTCGTAGGCCGTTTTATACTTCTTTTTTAGAAAGGTCAACCGAGCCGCCTGTTTCGGATCAGACGGAAGCCGTAAGTTCTGGACACTGCCATTCTGGTTAAGAATCTGGCGACGAACATCGGGCGTCCACATTGATAAAGCGGCTAGATCTCGGACTAGGTGCCGATTTACTAGCACGAACGTACCGCTTAAAACCGAACGAGTAAAAATCATCTGCGTGAACGGTTCAAAAGACTCGTTATTCCGGAGAATCTGTGCAGTGCTTGCCGTGGGCATTAACGCGATCAGCAACGAGTTCCGAAGTCCGGACCTGCGCATTCTTTCTCTCAAGATGTCCCACTCTTCTTCGGAATATTTGGCGGATGTTAAGGCACTGGGGCGTTTCAATAGATCGAAGTGCAATAGCCCTTTCTCGGAAAGACTTCCTGGAAACGACTCGAAGTGCCCGAACTTCTCCGCGAGATTAATGCTTTCTTCAACAGCGGCGTAGTACATCACTTCGAAAATTTGCCCGTTAAGAGTTCTAGCTTCTTCACCTTCCCAGGATAAATCCAATAACGCAAAAGCATCAGCTAGTCCTTGAACGCCGATACCAATCGGACGGTTCTTTAGGTTTGCATATCGAATCTCCGGTATGCCGTTGGGATAGTAAGTACGATCGATTGCTTGGTTCAGATTTTGCACCAGCTCGGCGACGAGGGTTCGTAGAACGTCAAAGTCGAAAACATCTTTTCCTTCTCGGGAGCGAAGGCACTTCGGCAAGCAGACGGCAGCTAAGTTGCACGAGGCTATTTCTTCCTGAGATGAGAACTCCACGATCTCGACACAGAGATTGGAGCACGGAATTGTGCCTAGATGCTGGTGATTGGAAGTTCTGTTGCACGCATCTTTGTAGAGCATGAAGGGCATCCCAGTCTCTTTCTGCGTGATGACGATCTGCTGCCATAACACTCTAGCACTTATTTTTGTTGCATTAGGGAAGTCCCTCTCGTAACGCCGGTAGAGAGATTCGAACTCGGCGCCCCAAACCTTTCCGAGGCCCGGAGCGTTTTTAGGACAAAAGAGCGTCCAAGAGTCGTCGTTTTTTACTCGCTTCATAAACAAGTCCGACACCCACAAGGCTTGGAACAGGTCTCGGGCGCGGAGGTCCTCGGGCCCCGTATTTTTGCGCAACTCCAAAAATTCTTGGATATCGACGTGCCAAGGCGGAAGGTACATAGTACCGCTTCCTTTCCGTCTGCCGCCTTGGTCGACGCTTCGGAGGATCTCTTGTTTTATTTTAAGCCAGTTCACGATTCCTTTGGAGCGTCCGAAATGTCGGATCGTGGAATGCCTAATGTTCGAGTAGTCGCACCCGATTCCGCCTGTGTTCTTGGAGATCACCGCACAATCGTGCCACGATTTTGTTAGTCCGGCCATCGAGTCGTCGATGGTCATGAGAAAACACGAACTCAACTGGGGTCGATCGGTTCCCGCGTTGTAGGCCGTGGGAGACGCGTGAGAGTACATCCCTGTGCTCAAGCGGTCGTACATTTTCTGGATTTGGGGGAGATTCGGGTACCAAATAAAAACTGCCATCCTAAGGTACATGTATTGCGGCGTTTCAAGATACACCGGGTCCGCCCGATCCGGAACCACCTTTCGGAGGAGGTAAGACTTAAAAAGCGTTGCGAAACCGAAAAGATCAAATTGCAAATCTCTCTCCGGGTGAAGCATCTGCTCCAAAACTTCTTGGTTTCGAGTAACAAATTTTTGATAACCGGGATCGAACATTTCTGGGAATTCGGAGACGATTTCTGCGAAAGAAAATTTTACTTTTTGTTTTAAAGCCCAGATTTGAATACGCCCCGCAAGAAGCGACCAGTCCGGGTGATCTAAATTGAGATCGGCGCAGACCTTAGCTAATTCTTCCGTATAGTCACAGATAGGAACTTGCGCATTTTCTTCCAGAACGCGGTCCAACCTTGTTTGGTCAACCTTAAGACCGGAGGCCAACACAAGCACTTCTGCGTTAGAGATGCGAGTCATCGTTTTACTGGCAAGAGACGGAAGATCGATTATTCTTATTAGGATTCAGTTTCTTAAATGCTCTTCACGGGCTTTGAGAAAAAACTCGCAAAGCGACTTGAGGTCTCCCTTTTGTTTTTTTTTATTTTTCATTTATACAACGTTTCTTAGTTTTCAGGGGAGGGACTGGCATCCTCGTTCTGGGTTCTGCGTGAAGCCCGATCAACAATAAGGAGTATAAAACACTCATCTCCCTTCCAGTGACGACGCCGGGTTCGCAACGGCAGCTCCCTGTTCCCATAATGTGAATTCCCGGTTGTGCATGAAGTACATAAATAAAAAGTGCTCGGTAAGTTAAAAAAATAGGTGGCTCCCAACGGAAAGTATTCGAGCTTTGAAGTACTTTTTACGTTTGCCAGCGAGTCCACAATCAGCGCCAACTGAAACAGCTGTGGACACGTTAAACATGCAAAACGTACATTTGCATAGGTAATCTCTTGAATGGGAACCACCTATTTTTTTAACTTACCGAGCACTTTTTATTTATGTACTTCATGCACAACCGGGAATTCACATTATGGGACGTAGGTACGACGAAAATCCAAAGAGCGC

COBRA_end_joining_pairs.txt:

k141_9211005_L  k141_7723568_Lrc
k141_7723568_Lrc    k141_9211005_L
k141_13152222_L k141_6797120_Lrc
k141_6797120_Lrc    k141_13152222_L
k141_6797120_L  k141_13152222_Lrc
k141_13152222_Lrc   k141_6797120_L
k141_7723568_L  k141_9211005_Lrc
k141_9211005_Lrc    k141_7723568_L
k141_22888658_L k141_10144865_R
k141_10144865_R k141_22888658_L
k141_10144865_R k141_9211005_Rrc
k141_18538400_L k141_22692969_R
k141_18538400_L k141_20768437_R
k141_22692969_R k141_18538400_L
k141_20768437_R k141_18538400_L
k141_20768437_L k141_13152222_R
k141_20768437_L k141_11312800_R
k141_13152222_R k141_20768437_L
k141_11312800_R k141_20768437_L
k141_9211005_R  k141_10144865_Rrc
k141_10144865_Rrc   k141_9211005_R
k141_10144865_Rrc   k141_22888658_Lrc
k141_9211005_Rrc    k141_10144865_R
k141_7723568_R  k141_6797120_Rrc
k141_6797120_Rrc    k141_7723568_R
k141_6797120_R  k141_7723568_Rrc
k141_7723568_Rrc    k141_6797120_R
k141_22888658_Lrc   k141_10144865_Rrc
k141_11312800_Rrc   k141_20768437_Lrc
k141_13152222_Rrc   k141_20768437_Lrc
k141_20768437_Lrc   k141_11312800_Rrc
k141_20768437_Lrc   k141_13152222_Rrc
k141_20768437_Rrc   k141_18538400_Lrc
k141_22692969_Rrc   k141_18538400_Lrc
k141_18538400_Lrc   k141_20768437_Rrc
k141_18538400_Lrc   k141_22692969_Rrc

Thanks!

linxingchen commented 9 months ago

Not this file but the COBRA_potential_joining_paths.txt file in the folder of intermediate.files. You probably need to send me the file of both runs. I will take a look tomorrow if you could send soon (here or linkingchan@gmail.com). Thanks.

Hocnonsense commented 9 months ago

Sure. the COBRA_potential_joining_paths.txt of the first run:

k141_10144865_L []
k141_10144865_R ['k141_22888658_L']
k141_7723568_L  []
k141_7723568_R  []
k141_22888658_L []
k141_22888658_R []
k141_13152222_L []
k141_13152222_R []
k141_11312800_L []
k141_11312800_R []
k141_22692969_L []
k141_22692969_R []
k141_20768437_L ['k141_11312800_R']
k141_20768437_R []
k141_9211005_L  []
k141_9211005_R  []
k141_18538400_L ['k141_20768437_R', 'k141_11312800_R']
k141_18538400_R []
k141_6797120_L  []
k141_6797120_R  []

for the second run:

k141_13152222_L []
k141_13152222_R []
k141_10144865_L []
k141_10144865_R ['k141_22888658_L']
k141_9211005_L  []
k141_9211005_R  []
k141_22692969_L []
k141_22692969_R []
k141_7723568_L  []
k141_7723568_R  []
k141_18538400_L ['k141_20768437_R', 'k141_11312800_R']
k141_18538400_R []
k141_20768437_L ['k141_11312800_R']
k141_20768437_R []
k141_6797120_L  []
k141_6797120_R  []
k141_22888658_L []
k141_22888658_R []
k141_11312800_L []
k141_11312800_R []
linxingchen commented 9 months ago

It is so weird, they should be the same given that the potential_joining_paths is exactly the same.

Could you please send me the debug file (could be huge I guess, if yes, pls send via email)? Thank you.

linxingchen commented 9 months ago

Hi, I checked the debug files you sent me. From where I could see that the sequences of M72_2|k141_22888658 and M72_2|k141_10144865 were validly joined until the last step. For now, I could not tell what is happening unfortunately. If possible, could you please perform the 1st run again (ensure everything is the same)? Thank you.

Hocnonsense commented 8 months ago

Sorry for the late reply. I created a new environment and install and run the same command again, and all the three resutls are different. Next I'll try to understand what happened durning the software running.

linxingchen commented 8 months ago

did you installed the newest version (there are some modifications therein) and thus got another different results?

Hocnonsense commented 8 months ago

No, i ran the last try three weeks ago.

linxingchen commented 8 months ago

that's so weird. I am still confused what are your input files of -f/--fasta and -q/--query.

Hocnonsense commented 8 months ago

I'm going to run the python script line by line, and I just format the code using black.

As you suggested, one should first use the assembly result with default parameters, then map reads to the assembly result for coverage. Next, raw assembly result will be parsed as fasta, and longer sequences from assembly result will be selected as query.

In my practice, I renamed sequences (add M72_2| to identify sample name) and only keep sequences longer or equal to 1000bp as "raw assembly result 1", and map reads to this "raw assembly result 1". Query sequences are those longer or equal to 2500bp contigs based on "raw assembly result 1"

linxingchen commented 8 months ago

if in your practice, -f/--fasta = "raw assembly result 1", then it is incorrect, -f/--fasta should be all the assembled contigs without length filtering. -q/--query could be contigs of any length though.

Hocnonsense commented 8 months ago

Sure, for further analysis, I'll keep all contigs for downstream study such as cobra.

What troubles me is that I've run the software for three times with all the same parameters, and get three different results. Is it designed to be? Or if I used "all the assembled contigs without length filtering", then the outputs can be consistent?

Thanks for your reply. Sincerely, hwrn.

linxingchen commented 8 months ago

Of course this is NOT COBRA designed to be. I have to run it myself to see what is going on.

Hocnonsense commented 8 months ago

Hello, I've found that some unconsistant action may caused by these codes:

https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L868-L892

This code read a mapped bam file, which can be generated from different mapping software such as bbmap.sh and bwa mem. However, the two software act differently when storing reads name in bam file:

Given a fq file with these sequences as example: _1.fq:

@DP8450004631BRL1C001R00500000953/1
CTCGACCTCAACCGGAACCGCATCCATGACTCGCTGGAGAACGCCTCCGGACCAGTGAATGTAGGGTTGTCATACGGCCATCAGCCGACCACGAGTGATGTCCGTGCTGTCAAGGAACTCGGCCATTCCGTGAACCTCGAGGTGCCTGAG
+
GG:@CHFI9HGIIIG>IGIFHIIFIIHIDHHIGIH?DGG:IIH@IIIHIH:GIHIDHHHH6@IIII2FIG+HHIHIIHHIIIIHIIIHIIHIFBI6HGHF;GIHHIHIEGGGFFGEGI7HGCGIHF8H:57D=GHIIFH8FIFHHGHE19

_2.fq.gz:

@DP8450004631BRL1C001R00500000953/2
CCAATGTCCCGTAGCGCTCGACCATCACCACGCCGTCCAGGTCAGCCAGGACCCACACGTCGTCCGCATCCACCGCCCCGAGAAGCAGAGCGTCTACCTCAGGCACCTCGAGGTTCACGGAATGGCCGAGTTCCTTGGCAGCACGGGCAT
+
GGGF5<'GCFFHHEEGF<GDF'7GDGIIC.<:GGHEGFEHE%FHGFF?DHDCEGIGGFFFEF*&CBG7F;FGC9&FFHA9EC>E'D=GEFFGGF&BH90G?GGF'=FIG?2HE@CEICF.EFF>;BHG>G6=*G?HD,AGBD>G*H7FFF

For BGI data, paired reads are identified with the suffix (/1 and /2), which will be kept by bbmap.sh but discarded by bwa mem. That is, the sam file generated by bbmap.sh may look like:

FP150000508TLL1C034R03603075335/1   97  k141_17086954   14  42  2X29=1X118= k141_8064633    197 0   CGCCCATGGCCACCAGGGCGACGAAGAGCGCGAAGTAGGACGCACGTGACAAGGTCATGATAAAAGGCACGAAGGCCGTCGAGGCGATCGCGAAGAATGTCCATCTCCAATGATAGGTCGGCGCATAGAGCGCTAGCGCCATGGACAGGC  FFFFEFDFFF>EEEFF.EFFFADFFFBFFFFFFFFFFEFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:3  AM:i:42
FP150000508TLL1C005R02300778018/2   145 k141_17086954   116 3   8=1X7=1X14=1X71=1X46=   k141_8064633    54  0   ATCTCCAACGATAGGTGGGCGCATAGAGCGCCAGCGCCATGGACAGGCAGATAACGATGAGCAGGTAGCCGCCGAGTGTATTGGGCTCCGTGCCCCCTGCCTCAAAGGGCGCACTCACGCGCGGCAGCGTGCCGATACTGATGATCCCGT  EFFFFFFEFFFFFFFFFFFFFFFFFFFFEFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  XT:A:R  NM:i:4  AM:i:3

while that generated by bwa mem may look like:

DP8450004631BRL1C017R00701242685    99  M80_2|k141_8788616  156 40  116S19M15S  =   156 19  ATGAAGTGTGATGATTTACTGTTCCAATAAGGAATATACTCAGGTCGCCCAATAGCGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCA  IIGGG@FHFGGHIGIGIIGIHGFGGIIHIHHGGGIIHIGIIHGFFGIGIGIIIIEFGFHGHGIHGHHGHIIDIHBGGHHIHGGHGDFGGHGIHIHIHEGG=IHIHHIGHGAEIIIHFGGGHIIIHIIIIIHGFBHHIIIGIHHGIIIIII  NM:i:0  MD:Z:19 MC:Z:61S19M70S  AS:i:19 XS:i:0
DP8450004631BRL1C017R00701242685    147 M80_2|k141_8788616  156 40  61S19M70S   =   156 -19 CGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCAGATTTAGAAAGATTATTAACAATATAATCAATAAAGCCCTCAGCAACTTTCCCAG  HIIIIGHFIIIHGHIIIHHIIIIIIIIHIIHIIIIIHIHIIIIIFIIIIIIHIIHIIIIIIIIHIIIHGIHIIIHIIIIIHIIIIHIIIIIHIIIIIHIIIIGIHIIIIIIIIIIIIIIIIIHIHIIHIIHHIIFIHHIIGIIGGGGHII  NM:i:0  MD:Z:19 MC:Z:116S19M15S AS:i:19 XS:i:0

In this case, https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L891 will never recognie read of paired into the same list.

Besides, why should we care about paired reads that both at the terminal of a contig, is it for determinate self-circular contig only? Why we should only care about the start of the read, but not the end, is it set based on experience?

Thanks!

p.s.: I'm trying to refactor cobra.py with suggestions from black and mypy, would you mind kindly accept some requests that make the code more structured and readable? : )

linxingchen commented 8 months ago

Hi,

Thank you so much for your efforts.

  1. regarding the header format from different mapping tools, have you tested if pysam could distinguish the "/1" and "/2" and remove them for the query_name?
  2. I do not think this is "inconsistance" but an overlook of input file format that should have been considered if pysam could not distinguish.
  3. relatedly, do you mean that you used mapping files from different tools for your previous three runs with different outputs?
  4. I am happy to accept any requests as long as they makes COBRA better.

Best, LINXING

Hocnonsense commented 8 months ago

Thanks!

  1. Yes, I've run the code block manually and checked line.query_name from map_file and PE name in contig_spanned_by_PE_reads. When I noticed {len(PE) for contig in contig_spanned_by_PE_reads for PE in contig} returns {1}, I checked reads name from the same contig, and then found reads names that nearly all the same but different suffices.
  2. Based on my test, I think pysam don't distinguish this.
  3. However, I used the same mapping file for the previous three runs, which is generated by bbmap.sh and may face the problem.
  4. It's my pleasure!

Sincerely, hwrn

linxingchen commented 8 months ago

could you please share me the different sam/bam files that you created?

regarding "why should we care about paired reads that both at the terminal of a contig, is it for determinate self-circular contig only? Why we should only care about the start of the read, but not the end, is it set based on experience?", it is not only for self-circular contigs but also, given the insert length in library construction, (1) you should NOT let the two paired-end reads span too distant, (2) paired-end reads must span the ends of two contigs that are going to be joined. Hope this makes sense to you.

Hocnonsense commented 8 months ago

Sure. The bam generated using bwa mem was mapped to dereplicated MAGs for other purpose, and the other bam I'll provide is the newly genereated one using all assembly result. However, the bam files are too large (>80G each), and I'll share it using email. Before that, the two example may be used for test directly?

For BGI data, paired reads are identified with the suffix (/1 and /2), which will be kept by bbmap.sh but discarded by bwa mem. That is, the sam file generated by bbmap.sh may look like:

FP150000508TLL1C034R03603075335/1 97  k141_17086954   14  42  2X29=1X118= k141_8064633    197 0   CGCCCATGGCCACCAGGGCGACGAAGAGCGCGAAGTAGGACGCACGTGACAAGGTCATGATAAAAGGCACGAAGGCCGTCGAGGCGATCGCGAAGAATGTCCATCTCCAATGATAGGTCGGCGCATAGAGCGCTAGCGCCATGGACAGGC  FFFFEFDFFF>EEEFF.EFFFADFFFBFFFFFFFFFFEFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:3  AM:i:42
FP150000508TLL1C005R02300778018/2 145 k141_17086954   116 3   8=1X7=1X14=1X71=1X46=   k141_8064633    54  0   ATCTCCAACGATAGGTGGGCGCATAGAGCGCCAGCGCCATGGACAGGCAGATAACGATGAGCAGGTAGCCGCCGAGTGTATTGGGCTCCGTGCCCCCTGCCTCAAAGGGCGCACTCACGCGCGGCAGCGTGCCGATACTGATGATCCCGT  EFFFFFFEFFFFFFFFFFFFFFFFFFFFEFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  XT:A:R  NM:i:4  AM:i:3

while that generated by bwa mem may look like:

DP8450004631BRL1C017R00701242685  99  M80_2|k141_8788616  156 40  116S19M15S  =   156 19  ATGAAGTGTGATGATTTACTGTTCCAATAAGGAATATACTCAGGTCGCCCAATAGCGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCA  IIGGG@FHFGGHIGIGIIGIHGFGGIIHIHHGGGIIHIGIIHGFFGIGIGIIIIEFGFHGHGIHGHHGHIIDIHBGGHHIHGGHGDFGGHGIHIHIHEGG=IHIHHIGHGAEIIIHFGGGHIIIHIIIIIHGFBHHIIIGIHHGIIIIII  NM:i:0  MD:Z:19 MC:Z:61S19M70S  AS:i:19 XS:i:0
DP8450004631BRL1C017R00701242685  147 M80_2|k141_8788616  156 40  61S19M70S   =   156 -19 CGGGATCTCGCAATACCTCGCTATAGTTTGGAGGGTATTCTGGGTCTTTAGTGAAGAATAAGAAATTACCAAAATCTGATTCTATAGTATCATCAGATTTAGAAAGATTATTAACAATATAATCAATAAAGCCCTCAGCAACTTTCCCAG  HIIIIGHFIIIHGHIIIHHIIIIIIIIHIIHIIIIIHIHIIIIIFIIIIIIHIIHIIIIIIIIHIIIHGIHIIIHIIIIIHIIIIHIIIIIHIIIIIHIIIIGIHIIIIIIIIIIIIIIIIIHIHIIHIIHHIIFIHHIIGIIGGGGHII  NM:i:0  MD:Z:19 MC:Z:116S19M15S AS:i:19 XS:i:0
linxingchen commented 8 months ago

hold on. 80 Gbp is too large.

Do you know if there are any other sequencers generate reads with headers like those from BGI?

Hocnonsense commented 8 months ago

I've searched my sequence data, but even the smallerst data is 26Gb (total size of clean.1.fq.gz and clean.2.fq.gz). The demo data provided by BGI is 100Gb (💯 ). However, I read the code of bwa mem and found out how bwa treat the paired-end reads:

  1. the input fq files are stored in aux.ks and aux.ks2: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/fastmap.c#L376-L392
  2. when process, seqs will be called by bseq_read: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/fastmap.c#L64-L73C15
  3. in bseq_read, readno will be trimmed first: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/bwa.c#L79C10-L94
  4. trim_readno just looked at the last two number, and remove them: https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/bwa.c#L54C20-L58
    static inline void trim_readno(kstring_t *s)
    {
    if (s->l > 2 && s->s[s->l-2] == '/' && isdigit(s->s[s->l-1]))
        s->l -= 2, s->s[s->l] = 0;
    }

This code is from a commit (moved some common code to bwa.{c,h}) 12 years ago with

linxingchen commented 8 months ago

If so, it looks like the /1 and /2 is not a problem am I correct?

Hocnonsense commented 8 months ago

I think that /1 and /2 is a normal way to identify paired reads, and can be moved once necessary. I've added a "--trim_readno" param to handle it

Hocnonsense commented 8 months ago

Meanwhile, I think that line 277 is useless for detect_self_circular: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L265-L285

When consider link_pair that collect linkage between header + "_L", header + "_R", header + "_Lrc", and header + "_Rrc", if maxk_length is set as odd number correctly, seq[:maxk_length] will never be the same as reverse_complement(seq[:maxk_length]) -- the middle base is ALWAYS different.

Hocnonsense commented 8 months ago

Next I'm puzzled at line 966: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L962-L969

I think that the end_part may occur in the middle of the sequence for some time. For example, I think this is a valid self_circular contig overlap with kmer[2n+1]: kmer[2n+1 - end_part] + kmer[end_part] + any + kmer[end_part] + any2 + kmer[2n+1 - end_part] + kmer[end_part]

linxingchen commented 8 months ago

Next I'm puzzled at line 966:

https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L962-L969

I think that the end_part may occur in the middle of the sequence for some time. For example, I think this is a valid self_circular contig overlap with kmer[2n+1]: kmer[2n+1 - end_part] + kmer[end_part] + any + kmer[end_part] + any2 + kmer[2n+1 - end_part] + kmer[end_part]

We should pay attention to those with the end_part in the middle, if ever exists. I have no idea why this happens, better to ignore such ones, that's why I set it == 2.

linxingchen commented 8 months ago

Meanwhile, I think that line 277 is useless for detect_self_circular:

https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L265-L285

When consider link_pair that collect linkage between header + "_L", header + "_R", header + "_Lrc", and header + "_Rrc", if maxk_length is set as odd number correctly, seq[:maxk_length] will never be the same as reverse_complement(seq[:maxk_length]) -- the middle base is ALWAYS different.

You are right, but (1) will this step take long? and (2) probably someone will use even kmer numbers?

And this line should be used to exclude some other abnormal cases of false positive "self-circular", I could not recall what it is unfortunately. Otherwise, I could only use if contig + '_R' in link_pair[end] for both one_path_end and two_paths_end, right?

Hocnonsense commented 8 months ago

We should pay attention to those with the end_part in the middle, if ever exists. I have no idea why this happens, better to ignore such ones, that's why I set it == 2.

In my test data, there is no such case among 172617 orphan_end_query sequences. However, I think it can be happen, for example, on a sequence contains CRISPR spacers, and the circulate sequence breaks near a spacer.

And this line should be used to exclude some other abnormal cases of false positive "self-circular", I could not recall what it is unfortunately.

Thanks for your explanation! I just curious about in which condition this line will effort. Besides, I think assemblers will rarely allow even kmer length to avoid kmer equal to reverse complement sequence of itself.

linxingchen commented 8 months ago
  1. xxxx ---- xxxx ---- xxxx (xxxx = the length of maxK for metaspades and megahit or maxK-1 for idba_ud) should not exist in my opinion.

  2. it is not for even kmer but something else that I cant recall for now.

Hocnonsense commented 8 months ago
  • xxxx ---- xxxx ---- xxxx (xxxx = the length of maxK for metaspades and megahit or maxK-1 for idba_ud) should not exist in my opinion.
  • it is not for even kmer but something else that I cant recall for now.
  1. For your definition, i agree that pattern xxxx ---- xxxx ---- xxxx (len(xxxx) = maxK) will never form in assembly. What I concern is the pattern yy xxx ---- xxx ---- yy xxx (len(xxx) = minK) when finding self_circular with not-long-enough expected overlap.
  2. OK, I've recorded this risk and will keep attention on it.

Thanks for your kind explanations!

linxingchen commented 8 months ago
  • xxxx ---- xxxx ---- xxxx (xxxx = the length of maxK for metaspades and megahit or maxK-1 for idba_ud) should not exist in my opinion.
  • it is not for even kmer but something else that I cant recall for now.
  1. For your definition, i agree that pattern xxxx ---- xxxx ---- xxxx (len(xxxx) = maxK) will never form in assembly. What I concern is the pattern yy xxx ---- xxx ---- yy xxx (len(xxx) = minK) when finding self_circular with not-long-enough expected overlap.
  2. OK, I've recorded this risk and will keep attention on it.

Thanks for your kind explanations!

I agree with you on 1. that makes sense. we should modify this. Thanks.

Hocnonsense commented 8 months ago

Hi, I've found that the loop in [09/23] can be speed up significantly by changing https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L1000-L1001 to:

    for contig in tqdm(
        query_set - (orphan_end_query | self_circular),
        desc="Detecting joins of contigs. ",
    ):

For example, in my data, len(query_set)=339757, len(orphan_end_query)=172917, len(self_circular)=181, the old version will more than 30 min to finish running, while the new version take only 3 seconds.


Meanwhile, I noticed that an upper bound of target contig is set when trying to append it to contig2join. https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L363 , seems to avoid adding target contig with too high coverage

I also notice that this restriction only appear in two_paths_end situation for all added contigs except the first one, so I'me also curious about why we don't worry about the first contig with much higher abundance to join with the contig?

linxingchen commented 8 months ago

Regarding [09/23], did you only modify the lines of 1000 and 1001? If yes, I do not think this modification is the main reason of the time reduction. And I am wondering if the time you monitored is exactly for the whole step.

Please explain why if you disagree.

I modified

for contig in query_set:
    if contig not in list(orphan_end_query) + list(self_circular)

to

for contig in query_set - (orphan_end_query | self_circular):

and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.

linxingchen commented 8 months ago

And Line 363 of

if cov[contig_name(target)] >= 1.9 * cov[contig]: 

is to exclude the case of repeat region longer than the max kmer, for example, transposase genes (xxxx below).

-------------xxxx-------------xxxx---------------xxxx------------------ ------1------.......-------2-----.......-------3------........-------4----------

if region 1 is the query, the join could be 1+2, 1+3 or 1+4, which could not be determined.

Hocnonsense commented 8 months ago

and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.

Oh, I tried a toy example on my local machine, and also cannot reproduce this problem:

from tqdm import tqdm

a = {"K141_{i}" for i in range(3000000)}
b = {"K141_{i}" for i in range(1500000)}

for contig in tqdm(a):
    if contig not in list(b):
        pass

However, I could still repruduce it using the dataset I'm testing on. It is repeatly transfer orphan_end_query to a list() that cost a lot of time (Probably a problem related to the memory alloc and useage)...


Thanks for your kind explaination! I've got it!

linxingchen commented 8 months ago

and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.

Oh, I tried a toy example on my local machine, and also cannot reproduce this problem:

from tqdm import tqdm

a = {"K141_{i}" for i in range(3000000)}
b = {"K141_{i}" for i in range(1500000)}

for contig in tqdm(a):
    if contig not in list(b):
        pass

However, I could still repruduce it using the dataset I'm testing on. It is repeatly transfer orphan_end_query to a list() that cost a lot of time (Probably a problem related to the memory alloc and useage)...

Thanks for your kind explaination! I've got it!

sorry what problem you are talking that you could not reproduce?

Hocnonsense commented 8 months ago

The extremely longer time costing when generating a list for orphan_end_query repreatly.

>>> timeit("list(orphan_end_query)", number=1000, globals=globals())
3.173712281975895
>>> len(orphan_end_query)
172917
>>> timeit("list(orphan_end_query)", number=1000, globals={"orphan_end_query": {f'k141_{i}' for i in range(172917)}})
3.205506692000199

This means that only 300 entry can be processed per second, and 1/(1000/3.2) * 339757 / 60 = 18 min will be waste for it...

And I've test it on my computer:

>>> timeit("list(orphan_end_query)", number=1000, globals={"orphan_end_query": {f'k141_{i}' for i in range(172917)}})
5.158639047993347