salzberg-lab / Balrog

Bacterial Annotation by Learned Representation of Genes
MIT License
54 stars 5 forks source link

what contain genexa_genes.fasta? #3

Open igortru opened 4 years ago

igortru commented 4 years ago

it doesn't look like complete protein fasta I have blastp random sequence ">1119" from it - no hits into nr.

I have blastp first 80 sequences - only very weak hits.

"In this step, we run all predictions against non-hypothetical protein coding gene sequence from a set of 177 diverse bacterial genomes. All reference genomes in this step do not share a genus with any of the test set organisms. Predictions are also run against theSWISS-PROT curated protein sequence database"

P.S. I have successfully converted Balrog colab into standalone CPU version, it is manual process , but absolutely straightforward. Plans move it into GPU enviroment .

P.P.S same way as you now I am learning original publication how develop training part.

Markusjsommer commented 4 years ago

Hi Igor,

The genexa_genes.fasta contains a bunch of genes from a diverse set of bacteria. The sequences in the file are 3' to 5' (to be consistent with the scoring direction of the gene model), so you'll need to reverse them to get blastp hits.

I'm working on a C++ standalone with a few more features, so it will take me a bit more time, but if you have a good working python version from the notebook feel free to make a pull request. That would be very helpful!

I'm not currently planning on making an open source method to go from sequence to trained model, as that requires a lot more moving parts/infrastructure, but releasing new models trained on different sources (e.g. phage/virus) would not be too difficult if there is interest.

vdejager commented 3 years ago

Hi Markus,

is there any news on the method to train on different sources?

Markusjsommer commented 3 years ago

Hi Vic,

It's pretty messy but here's an old ipynb i used to train the model on Google Colab.

The training data shingles are 100-mers of amino acid sequence in a numpy array where each amino acid is represented by an integer 1 to 20 (e.g. H = 1, Q = 2, ...). Positive examples for Balrog are from named genes, and negative are the same nucelotide sequence but translated in the incorrect frame, though it could be anything you want in theory.

TCN_train_memmap_newtrain.zip