Extracting mutation data from the jsonl tree

theosanderson / taxonium

A tool for exploring very large trees in the browser

http://taxonium.org

GNU General Public License v3.0

98 stars 17 forks source link

Extracting mutation data from the jsonl tree #581

Closed Biophylo2001 closed 4 months ago

Biophylo2001 commented 5 months ago

Hi, I've generated my SARS-CoV-2 tree (jsonl format) using the usher command lines. Now, I want to extract mutation frequency and identify predominant amino acid mutations in my tree. How can I efficiently do this so that i can use that information with Taxonium's> Search > Mutation section for studying spread since manually inputting all mutations is time-consuming.

Do I do that with from my combined VCF files or can it be done using the JSONL file itself? How was this process implemented in creating the Cov2tree? Thank you. Here's how my merged VCF file looks:

snpeffdata.vcf.gz

Thanks a lot

theosanderson commented 5 months ago

The first line of the jsonl contains various bits of data for the whole tree, including a list of mutations. Each line below has a list of mutations with numbers where the number is the index of the mutation in the initial list. So you could analyse it with that. You can also analyse the Usher protobuf with e.g. BTE (big tree explorer)

Biophylo2001 commented 5 months ago

@theosanderson Thank you for your reply, I have tried using the jsonl file to extract the mutation data however, in the Treenome browser it doesn't show the accurate number of result. Maybe I am doing something wrong.

For example, For the S 1082 mutation in the screenshot, it shows only "1 result" circled at the root node. What does that mean? I am sure there are plenty of S 1082 residue mutations present but its not shown accurately.

Screenshot (375)

theosanderson commented 5 months ago

Sorry that I missed this. Without seeing your tree file it's hard to assess what's going on here.

Biophylo2001 commented 5 months ago

@theosanderson Here i have attached my tree file . Thank you for looking into it. treefile.jsonl.gz

theosanderson commented 5 months ago

S:1082 appears to be C throughout your tree