Open theosanderson opened 1 year ago
@lilymaryam, @aofarrel, @russcd and I are using taxonium to view UShER trees of M. tuberculosis genomes and it works fine unless we try to use the usher_to_taxonium --genbank
option to see what the protein-coding mutations are. Then the first line of .jsonl.gz becomes so huge (650MB-900MB+ depending on the size of the tree & filtering options) that it apparently exceeds v8's string length limit of 512MB and node crashes with the error RangeError: Invalid string length
(https://github.com/nodejs/node/issues/35973). A more compact JSON representation of mutations might help, and/or splitting some of the first line values into multiple lines? I can provide example files if that would help.
Thanks a lot for the report, and it's exciting that you are doing this!
Have you tried adding the --only_variable_sites
parameter? I think the issue could be about the encoding of the ref genome. I definitely need a better solution to that generally, and intend that, but it could be a kind of workaround for now.
Have you tried adding the
--only_variable_sites
parameter?
Ah, I didn't know about that one! And it does fix it! Thanks and I'll use that for large genomes going forward (and make sure to look at the --help
again next time I have a problem 🙂).
Fantastic, and no prob, and it still definitely needs a real solution!
I know that some people have/are using Taxonium for bacterial genomes. I suspect that this probably poses some issues with e.g. the number of mutations on each branch which might get overwhelming. If any of you folks would like to chat about ways to make your experience better, do let me know!