ukaraoz / microtrait-hmm

Profile HMMs for MicroTrait
3 stars 1 forks source link

[Question] Discrepancy between traits on GitHub, supplementary tables, and hierarchy confusion? #5

Open jolespin opened 5 months ago

jolespin commented 5 months ago

I'm trying to understand how microtrait-hmm traits are organized and how to work my way up the hierarchy.

Here are the trait HMMs here: https://github.com/ukaraoz/microtrait-hmm/blob/master/data

Now separately, we have the Supplementary Tables from your manuscript (attached):

Are some traits not associated with any rules?

Can you describe exactly what is meant between: binary, count, and count_by_substrate in the microtrait_rule-type field of Table S5? I read the paper Methods but this part wasn't clear to me. Binary I'm assuming is presence absence but what is meant by count and count_by_substrate?

Why are traits missing fields for micro trait_trait-name1, micro trait_trait-name2, micro trait_trait-name3? I'm confused between the usage of traits and rules here. Are the traits hierarchical or the rules? For example, there are 975 rules here but 628 of them are missing fields for those columns.

Very interested in using this but I need more information before I can use this on our dataset.

Table 1.XLSX

ukaraoz commented 5 months ago

Hi, Thank you for your interest. microtrait-hmm database is separate from microtrait, and not all the hmms under that contribute to the final set of rules and traits in production. Similarly, not all the rules do contribute to the final trait hierarchy, such as they are not part of the production hierarchy. dbxref table are for those hmms that had a cross reference to a particular external database. Binary trait is presence/absence, count is otherwise when the trait reflects a certain degree of genome investment for the trait in question. There is no reference to count_by_substrate in the paper, that is a helper function to summarize traits primarily for substrate acquisition. Traits are hierarchical, rules are not. The hierarchy has the 3 levels, with traits summarized at each level. Hope that helps.

jolespin commented 5 months ago

Thanks for your help!

microtrait-hmm database is separate from microtrait, and not all the hmms under that contribute to the final set of rules and traits in production. Similarly, not all the rules do contribute to the final trait hierarchy, such as they are not part of the production hierarchy.

This is good to know. Are there any categorizations within the HMM sets in microtrait-hmm? For example, HMMs set[x, y, and z] are associated w/ function A?

dbxref table are for those hmms that had a cross reference to a particular external database. Makes sense based on the dbxref in the file name.

Binary trait is presence/absence This makes sense too.

count is otherwise when the trait reflects a certain degree of genome investment for the trait in question Would you mind elaborating on this a bit more? I'm not following. Is this the number of HMM hits w/in a genome that meets a certain threshold?

There is no reference to count_by_substrate in the paper, that is a helper function to summarize traits primarily for substrate acquisition. The count_by_substrate is a value in the microtrait_rule-type column of Table S5 and microtrait_trait-type column of Table S7. It's also mentioned in supplementary description of the main text: "Mapping of microTrait rules to the microTrait hierarchy. microTrait traits are either of type binary or count. Count traits can be counted by themselves or by their substrate (microtrait_rule-type = “count_by_substrate”) in case of transporters. Refer to ST6 for the mapping between substrates and the microTrait hierarchy."

Traits are hierarchical, rules are not. Great, this is what I thought but I just wanted to be sure.

The hierarchy has the 3 levels, with traits summarized at each level.

So based on Table S7, there are 326 traits in total? Or are these only the traits that have a defined hierarchy?

Is an HMM equivalent to a trait in this context?

Hope that helps.

ukaraoz commented 5 months ago

Hi, Here is some more clarifications

microtrait-hmm database is separate from microtrait, and not all the hmms under that contribute to the final set of rules and traits in production. Similarly, not all the rules do contribute to the final trait hierarchy, such as they are not part of the production hierarchy.

This is good to know. Are there any categorizations within the HMM sets in microtrait-hmm? For example, HMMs set[x, y, and z] are associated w/ function A? No, you go through the rules, as hmm to trait relationship is not one-to-one.

count is otherwise when the trait reflects a certain degree of genome investment for the trait in question Would you mind elaborating on this a bit more? I'm not following. Is this the number of HMM hits w/in a genome that meets a certain threshold? I would refer you to the paper, and especially to the portion and supplementary tables about how the information from TCDB is used. More or less yes but not at the hmm level. For instance, for transporters for a particular substrate class, if the transporter is a protein complex with three individual proteins (each detected by an hmm), then that complex is counted as 1. A lot of times, there are multiple such complexes in the genome, each is counted as one.

There is no reference to count_by_substrate in the paper, that is a helper function to summarize traits primarily for substrate acquisition. The count_by_substrate is a value in the microtrait_rule-type column of Table S5 and microtrait_trait-type column of Table S7. It's also mentioned in supplementary description of the main text: "Mapping of microTrait rules to the microTrait hierarchy. microTrait traits are either of type binary or count. Count traits can be counted by themselves or by their substrate (microtrait_rule-type = “count_by_substrate”) in case of transporters. Refer to ST6 for the mapping between substrates and the microTrait hierarchy." Count by substrate is for counting the different protein families with evidence for targeting the same substrate. So you cannot count just the rules that map to the trait but need to aggregate by the substrate (which can be at a different level of granularity).

Traits are hierarchical, rules are not. Great, this is what I thought but I just wanted to be sure.

The hierarchy has the 3 levels, with traits summarized at each level.

So based on Table S7, there are 326 traits in total? Or are these only the traits that have a defined hierarchy?

Yes but that is the total from the three hierachical level. You would use one level only, not mix and match.

Is an HMM equivalent to a trait in this context? Some traits map to single HMM but not in general. So, no. Hope that helps.