Questions on input fasta files

Hocnonsense commented 6 months ago

Thanks for this great tool! However, I have a few questions on input for cobra

Now I'm using megahit to assembly (I just know from #3 that it may generate many chimeric contigs, but spades may run out of memory when assembly environmental samples). In megahit intermediate results, it will provide a file named ./intermediate_contigs/k141.contigs.fa, in which there are many small contigs < 200 bp (which will be filtered in ./final.contigs.fa). My question is, which contig file is more recommended to be used as --fasta FASTA input?
Is it possible to just use the final.contigs.fa as --query QUERY file, or a filtered version with all contigs longer than given length (i.e. 1000 or 2500 bp)? In another words, can cobra be used before virus contigs annotation and MAG binning?

Regards, hwrn

linxingchen commented 5 months ago

great. so we should modify from

for contig in query_set:
    if contig not in list(orphan_end_query) + list(self_circular)

to

for contig in query_set - (orphan_end_query | self_circular):

right?

linxingchen commented 5 months ago

Regarding [09/23], did you only modify the lines of 1000 and 1001? If yes, I do not think this modification is the main reason of the time reduction. And I am wondering if the time you monitored is exactly for the whole step.

Please explain why if you disagree.

I modified
for contig in query_set:
    if contig not in list(orphan_end_query) + list(self_circular)
to
for contig in query_set - (orphan_end_query | self_circular):
and checked one sample, and found the time reduced from ~5 mins to ~4 mins. I will check more samples though.

I tested another sample, this time the modified version took even longer than the original one for step [09/23] (~14 mins vs ~12 mins). I think the time consumed is not just for check the list, most of it is for the join step...

Hocnonsense commented 5 months ago

Probably... May I know how many contigs in query_set, orphan_end_query, and self_circular in your sample?

In my sample, for 339757 contigs in contig2assembly, 214342 of then only link to one other contig:

>>> from collections import Counter
>>> Counter(len(i) for i in contig2assembly.values())
Counter({1: 214342, 2: 54686, 3: 27750, 4: 15735, 5: 9361, 6: 5882, 7: 3847, 8: 2466, 9: 1608, 10: 1158, 11: 791, 12: 458, 13: 456, 14: 315, 15: 226, 16: 139, 17: 101, 18: 99, 19: 98, 20: 68, 21: 39, 23: 34, 25: 23, 22: 20, 24: 15, 29: 9, 26: 6, 28: 6, 35: 5, 31: 4, 32: 3, 33: 3, 30: 2, 38: 1, 27: 1})
>>> len(contig2assembly)
339757

I think my modify may improve performance for samples with a lot of contigs with fewer linkage between them (such as environmental sample with low biomass)

Hocnonsense commented 5 months ago

Now I'm puzzled by a question and demand for your help. I'm trying to understand the for-loop here, and stucked here: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L1106-L1109

For my mind, these code is purpose to compare all query contigs and check if one assembly is the subset of another contig. However, according to this, the Computational Complexity is $O(n^2k)$, where $n$ is len(contig2assembly) and k is the average size of assembly contig number for each query. When the query get bigger and bigger, it will be very unacceptable (now I've processed 33069 of 339757 in my sample, spending 1 hour and expect to wait for 10.5 hours, 8.12it/s).

(first) I want to change the code to:

    for contig_1 in contig2assembly.keys():
        for contig in contig2assembly.keys():
            if contig != contig_1 and contig2assembly[contig].issubset(contig2assembly[contig_1]):
                if contig2assembly[contig] != contig2assembly[contig_1]:

now all contig pairs will be compared as well

(next) I want to change the code to:

    for contig_1 in contig2assembly.keys():
        for contig in contig2assembly[contig_1] & contig2assembly.keys():
            if contig != contig_1 and contig2assembly[contig].issubset(contig2assembly[contig_1]):
                if contig2assembly[contig] != contig2assembly[contig_1]:

Here, as contig iteslf must be in contig2assembly[contig], so for contig not in contig2assembly[contig_1], contig2assembly[contig].issubset(contig2assembly[contig_1]) will be always False, and we can reduce the Computational Complexity to $O(nk^2)$. However, k can be significantly higher than n in some situations? (probably you have some examples, then this can be more useful:

    for contig_1 in contig2assembly.keys():
        if len(contig2assembly[contig_1]) < len(contig2assembly):
            potential_sub_contigs = (i for i in contig2assembly[contig_1] if i not in contig2assembly)
        else:
            potential_sub_contigs = contig2assembly
        for contig in potential_sub_contigs:
            if contig != contig_1 and contig2assembly[contig].issubset(contig2assembly[contig_1]):
                if contig2assembly[contig] != contig2assembly[contig_1]:

This Computational Complexity is $O(nk(min(n, k)))$, which is accepable for most times, with a few more complex in the code structure....

Would you mind to test the last two modifications on your samples? Thanks for your kind help!

Sincerely, hwrn

linxingchen commented 5 months ago

this will only check the contigs that could be extended, do you have 339757 extended in one sample?

Hocnonsense commented 5 months ago

Yes. Maybe I should check if something went wrong previous.

>>> from collections import Counter
>>> Counter(len(i) for i in contig2assembly.values())
Counter({1: 214342, 2: 54686, 3: 27750, 4: 15735, 5: 9361, 6: 5882, 7: 3847, 8: 2466, 9: 1608, 10: 1158, 11: 791, 12: 458, 13: 456, 14: 315, 15: 226, 16: 139, 17: 101, 18: 99, 19: 98, 20: 68, 21: 39, 23: 34, 25: 23, 22: 20, 24: 15, 29: 9, 26: 6, 28: 6, 35: 5, 31: 4, 32: 3, 33: 3, 30: 2, 38: 1, 27: 1})
>>> len(contig2assembly)
339757

linxingchen commented 5 months ago

OK. I will test your codes for my sample(s) later today. Thanks.

linxingchen commented 5 months ago

Probably... May I know how many contigs in query_set, orphan_end_query, and self_circular in your sample?

In my sample, for 339757 contigs in contig2assembly, 214342 of then only link to one other contig:
>>> from collections import Counter
>>> Counter(len(i) for i in contig2assembly.values())
Counter({1: 214342, 2: 54686, 3: 27750, 4: 15735, 5: 9361, 6: 5882, 7: 3847, 8: 2466, 9: 1608, 10: 1158, 11: 791, 12: 458, 13: 456, 14: 315, 15: 226, 16: 139, 17: 101, 18: 99, 19: 98, 20: 68, 21: 39, 23: 34, 25: 23, 22: 20, 24: 15, 29: 9, 26: 6, 28: 6, 35: 5, 31: 4, 32: 3, 33: 3, 30: 2, 38: 1, 27: 1})
>>> len(contig2assembly)
339757                                                                                                          
I think my modify may improve performance for samples with a lot of contigs with fewer linkage between them (such as environmental sample with low biomass)

for the second sample I tested, query_set = 7532 orphan_end_query = 2744 self_circular = 354

Hocnonsense commented 5 months ago

this will only check the contigs that could be extended, do you have 339757 extended in one sample?

Thanks for your kind suggestion! I do detect unextendable contigs in contig2assembly

>>> sum(1 for i, v in contig2assembly.items() if v == {i})
214342

And I'll find how these unextendable added to contig2assembly: first, all query contigs are considered in contig2join: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L987-L994 and only a few of them are extendable:

>>> len({contig_name(i) for i in contig2join})
339757
>>> len({contig_name(i) for i, j in contig2join.items() if len(j)})
125415

next, contig2assembly is generated based on contig2join: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L1063-L1093 in all branches, contig in contig2join will be recored in contig2assembly via: https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L1085-L1087 and that's why a lot of contigs in contig2assembly have only contig2assembly[contig] == {contig}

To ignore these unused contigs, I think that it is necessary to check if len(contig2join[item]) == 0 before put item into contig2assembly in L1065:

    contig2assembly: dict[str, set[str]] = {}
    for item in contig2join:
        if len(contig2join[item]) == 0:
            continue
        contig = contig_name(item)

Hocnonsense commented 5 months ago

The for loop since 1106 will cause inconsistant results because of different order of getting contig from contig2assembly, I'll show this with a function:

def update_path_circular(reverse=False):
    contig2assembly = {
        "k141_24639557": {"k141_3764354", "k141_6610253", "k141_5597650", "k141_24639557"},
        "k141_3764354": {"k141_3764354", "k141_6610253", "k141_5597650", "k141_24639557"},
        "k141_5597650": {"k141_3764354", "k141_6610253", "k141_5597650", "k141_24639557"},
        "k141_6610253": {
            "k141_3764354",
            "k141_6610253",
            "k141_3763001",
            "k141_5597650",
            "k141_24639557",
        },
    }
    path_circular = {"k141_24639557", "k141_3764354", "k141_5597650"}
    failed_join_list = []
    redundant = set()
    is_subset_of = {}
    is_same_as = {}
    for contig in sorted(contig2assembly, reverse=reverse):

https://github.com/linxingchen/cobra/blob/6536dfc49792d9c602d6bc39ed983d36cc169bd1/cobra.py#L1107-L1152

        print(f"{contig2assembly=}")
        print(f"{path_circular=}")
        print(f"{failed_join_list=}")
        print(f"{redundant=}")
        print(f"{is_subset_of=}")
        print(f"{is_same_as=}")

And let's test it:

>>> update_path_circular(reverse=False)
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354', 'k141_3763001'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_24639557'}
is_same_as={'k141_24639557': {'k141_5597650', 'k141_3764354'}}
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_24639557'}
is_same_as={'k141_24639557': {'k141_5597650', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557'}}
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_24639557'}
is_same_as={'k141_24639557': {'k141_5597650', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557'}, 'k141_5597650': {'k141_6610253', 'k141_24639557', 'k141_3764354'}}
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_24639557'}
is_same_as={'k141_24639557': {'k141_5597650', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557'}, 'k141_5597650': {'k141_6610253', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_5597650', 'k141_24639557', 'k141_3764354'}}
>>> update_path_circular(reverse=True)
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354', 'k141_3763001'}}
path_circular={'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant=set()
is_subset_of={}
is_same_as={}
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354', 'k141_3763001'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_5597650'}
is_same_as={'k141_5597650': {'k141_24639557', 'k141_3764354'}}
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_5597650'}
is_same_as={'k141_5597650': {'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557'}}
contig2assembly={'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_5597650': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}, 'k141_6610253': {'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}}
path_circular={'k141_6610253', 'k141_5597650', 'k141_24639557', 'k141_3764354'}
failed_join_list=[]
redundant={'k141_6610253'}
is_subset_of={'k141_6610253': 'k141_5597650'}
is_same_as={'k141_5597650': {'k141_24639557', 'k141_3764354'}, 'k141_3764354': {'k141_6610253', 'k141_5597650', 'k141_24639557'}, 'k141_24639557': {'k141_6610253', 'k141_5597650', 'k141_3764354'}}

"k141_6610253" is added to is_same_as with sorted but not with sorted(reverse=True)

While contig2assembly and path_circular can be changed durning iteration, I'm worried that the results may be different between different runs.

Let's take another example:

def update_path_circular_1(reverse=False):
    path_circular = {"k141_1", "k141_2", "k141_4"}
    contig2assembly = {
        "k141_1": {"k141_1", "k141_2"},
        "k141_2": {"k141_1", "k141_2"},
        "k141_3": {"k141_1", "k141_2", "k141_3"},
        "k141_4": {"k141_1", "k141_2", "k141_3", "k141_4"},
    }

and runs with different iter order:

>>> update_path_circular_1()
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2', 'k141_3'}, 'k141_4': {'k141_1', 'k141_4', 'k141_2', 'k141_3'}}
path_circular={'k141_1', 'k141_4', 'k141_2'}
failed_join_list=['k141_1', 'k141_3']
redundant={'k141_1'}
is_subset_of={'k141_1': 'k141_4'}
is_same_as={'k141_1': {'k141_2'}}
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2', 'k141_3'}, 'k141_4': {'k141_1', 'k141_4', 'k141_2', 'k141_3'}}
path_circular={'k141_1', 'k141_4', 'k141_2'}
failed_join_list=['k141_1', 'k141_3', 'k141_2', 'k141_3']
redundant={'k141_1', 'k141_2'}
is_subset_of={'k141_1': 'k141_4', 'k141_2': 'k141_4'}
is_same_as={'k141_1': {'k141_2'}, 'k141_2': {'k141_1'}}
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2', 'k141_3'}, 'k141_4': {'k141_1', 'k141_4', 'k141_2', 'k141_3'}}
path_circular={'k141_1', 'k141_3', 'k141_4', 'k141_2'}
failed_join_list=['k141_1', 'k141_3', 'k141_2', 'k141_3']
redundant={'k141_1', 'k141_3', 'k141_2'}
is_subset_of={'k141_1': 'k141_4', 'k141_2': 'k141_4', 'k141_3': 'k141_4'}
is_same_as={'k141_1': {'k141_2'}, 'k141_2': {'k141_1'}}
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2', 'k141_3'}, 'k141_4': {'k141_1', 'k141_4', 'k141_2', 'k141_3'}}
path_circular={'k141_1', 'k141_3', 'k141_4', 'k141_2'}
failed_join_list=['k141_1', 'k141_3', 'k141_2', 'k141_3']
redundant={'k141_1', 'k141_3', 'k141_2'}
is_subset_of={'k141_1': 'k141_4', 'k141_2': 'k141_4', 'k141_3': 'k141_4'}
is_same_as={'k141_1': {'k141_2'}, 'k141_2': {'k141_1'}}
>>> update_path_circular_1(reverse=True)
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2', 'k141_3'}, 'k141_4': {'k141_1', 'k141_4', 'k141_2', 'k141_3'}}
path_circular={'k141_1', 'k141_4', 'k141_2'}
failed_join_list=[]
redundant=set()
is_subset_of={}
is_same_as={}
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2', 'k141_3'}, 'k141_4': {'k141_1', 'k141_4', 'k141_2', 'k141_3'}}
path_circular={'k141_1', 'k141_3', 'k141_4', 'k141_2'}
failed_join_list=[]
redundant={'k141_3'}
is_subset_of={'k141_3': 'k141_4'}
is_same_as={}
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2'}, 'k141_4': {'k141_1', 'k141_2'}}
path_circular={'k141_1', 'k141_3', 'k141_4', 'k141_2'}
failed_join_list=[]
redundant={'k141_3'}
is_subset_of={'k141_3': 'k141_4'}
is_same_as={'k141_2': {'k141_1', 'k141_4', 'k141_3'}}
contig2assembly={'k141_1': {'k141_1', 'k141_2'}, 'k141_2': {'k141_1', 'k141_2'}, 'k141_3': {'k141_1', 'k141_2'}, 'k141_4': {'k141_1', 'k141_2'}}
path_circular={'k141_1', 'k141_3', 'k141_4', 'k141_2'}
failed_join_list=[]
redundant={'k141_3'}
is_subset_of={'k141_3': 'k141_4'}
is_same_as={'k141_2': {'k141_1', 'k141_4', 'k141_3'}, 'k141_1': {'k141_4', 'k141_3', 'k141_2'}}

This ambiguous assembly results should be combined and clarified In a way that is independent of order.

linxingchen commented 5 months ago

yes. i found this issue as well. we should discuss more via WeChat, add me (linking_mumu) if you do not mind.

linxingchen / cobra

Questions on input fasta files #15