Classifying eukaryotic transcripts

alephreish commented 5 months ago

Hi, I'm interested in classifying transcripts from (mostly unicellular) eukaryotes and thought that Bertax might be the right approach if trained on the relevant data.

The training set includes ~950 representative taxa covering the whole range of eukaryotic diversity from Eukprot. Eukprot itself only contains protein sequences, but I have the original assemblies collected from the various sources.

Three questions:

can mRNA or CDS be used for training and is there a suitable pre-trained model? ~50% of the assemblies are transcriptomes;
is it possible to use a custom taxonomy? Eukprot comes with its own taxonomic classification. ~15% of the species do not appear in NCBI taxonomy;
what would be the rough computational burden involved in training (assuming a pre-trained model is available)? The size of the CDS data is ~25G.

Thanks!

f-kretschmer commented 5 months ago

Hi, that sounds very interesting! There are two pre-trained models available here, bert_nc_C2_final.h5, which is used (after fine-tuning) in the "final" BERTax model and application and bert_gene_D_final.h5 which was used in an early stage of development. This "gene" model was trained solely on gene sequences in contrast to the "genomic" model, which uses fragments of the whole genomes. However, all of the models were pre-trained on all phylogenetic "superkingdoms", not just Eukaryotes, so perhaps redoining the pre-training would be a good Idea. To use a custom taxonomy, the code would have to be adapted a little bit, maybe it is even enough to modify the get_dicts-function here to generate the dictionary from a different source. Unfortunately I can't really give an estimate on the computational cost/time, it very much depends on the GPU being used, but I think it shouldn't be too much with this size!

alephreish commented 5 months ago

Thanks for the directions! I'll give it a try and let you know how it goes.

alephreish commented 5 months ago

I'm a bit confused by this passage in the README: "each sequence is [to be] contained in a fasta file". So, each gene sequence should be in a separate file? In my case it would amount to 27,050,878 files (on average: ~29K genes per species, ~950nt per gene), this is way more than allowed e.g. per directory.

Another question: What is the significance of the 'class' structure? From pretraining_dataset.zip I get it that the classes are the three domains of life plus viruses.

f-kretschmer commented 5 months ago

We only used the gene model and this directory structure in the beginning of the development of BERTax; there we did not encounter any problems with it, albeit with a much lower number of files, if I remember correctly (and on linux; I am not aware of a hard limit on the number of files there). However, the directory structure is pretty flexible even without changing any code, as the paths for all sequence/gene fastas are taken from the files.json file, as detailed in the README, which allows an arbitrary directory structure as long as each sequence is contained in a separate file. This is of course not very efficient for millions of of files, so I would suggest either adapting the code loading the data or, preferably, moving to the "genomic" model structure. There JSONs and/or multifastas are used. The four classes of the directory structure also are mainly used because we started BERTax development to differentiate between these four "superkingdoms". This can of course be freely adapted! In general, the directory structure does not have any effect on the model architecture and is mainly a result of how we organized our data locally. For a different directory structure, I don't think there is much modification of the code needed.

alephreish commented 5 months ago

@f-kretschmer thanks for your response! (and BTW thanks in general for the nice project!)

By default even ext4 would not allow that many files per directory, but sure, a natural thing to do is to put the split fastas in sub-directories per species. I'll make a look at the code, it should not be much of an effort to switch to multifasta.

alephreish commented 5 months ago

I've adapted models.bert_finetune for multifasta input.

For now, I've decided not to rerun pre-training - instead playing with fine-tuning bert_gene_D_final.h5. I've discovered that neither TaxDB, nor TaxidLineage are actually used in the training here - instead the classes assigned in files.json are directly used as taxonomic labels, right? This makes it quite flexible but also means that only a flat architecture is possible. So I guess I'll train a set of models for different ranks (I think two or three top ranks would suffice for my aims), or would you recommend something else?

I'm also curious as to what exactly was used for pre-training bert_gene_D_final.h5 - whole gene sequences with UTRs and introns or only CDS?

f-kretschmer commented 5 months ago

instead the classes assigned in files.json are directly used as taxonomic labels, right?

I think you are right, with these gene models we only did classification by superkingdom, so using one output layer. It probably needs some more adaptation, but we found out (for the genomic mode, that is) that adding multiple output layers (phylum, genus) can also increase the performance for the more general ranks (superkingdom), see the supplement of the PNAS paper. So going this route instead of using only one output layer or multiple models might be worth a try.

what exactly was used for pre-training bert_gene_D_final.h5

We used the coding sequences for Uniref50 proteins to classify the superkingdom for the "gene"-models.

rnajena / bertax_training

Classifying eukaryotic transcripts #13