igemmcmaster / genome-transformer

Pretrained efficient transformers on genomes -- WIP
3 stars 2 forks source link

Incorporating Phylogeny #10

Closed frankcsquared closed 3 years ago

frankcsquared commented 3 years ago

Tasks:

matthewcso commented 3 years ago

Phylogeny exists in GBFF files, which are read by BioPython (these files also contain a variety of other metadata). No further action is required to improve this organization on the level of single files; an example of how to read the files is shown below in this notebook.

https://drive.google.com/file/d/1Q_OcI3n-sDB_8RhjUuoSeZOtE5vIYW41/view?usp=sharing

matthewcso commented 3 years ago

I'm closing the issue, but I'm noting that it might be good to read through all the files and store all the phylogeny data in a dataframe (in case you want to easily subset a certain group of bacteria).