Question about training a BERTax model for phylum to species taxonomy classification

Steven-GUHK commented 1 year ago

Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this: species_1.fasta:

sequence_1 ATCG... sequence_2 ATCG... ...

species_2.fasta

sequence_1 ATCG... sequence_2 ATCG... ...

species_n.faste

sequence_1 ATCG... sequence_2 ATCG... ... where each fasta file is a species that has a corresponding taxonomy label (from phylum to species). Inside each fasta file, there may contain more than one sequence of this species.

I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:

Thank you very much if you could provide me with some suggestions about my task!

f-kretschmer commented 1 year ago

Hi!

Since we used the gene-model-structure only in the beginning of our development, more work would be required to adapt the training process for the new task. But for the genomic model, probably not a lot has to be changed. Your data would need to be in the "fragment"-type of structure, which can be generated from multifastas (https://github.com/f-kretschmer/bertax_training#multi-fastas-with-taxids). You would have to concatenate your data into a single file and adapt the header of each sequence in the following way:

>species_1_taxid 0 
ATCG....
>species_1_taxid 1
ATCG....

....

>species_1_taxid m
ATCG....
>species_2_taxid m + 1
ATCG....

...

>species_n_taxid x
ATCG....

The first value of the header is simply the NCBI-TaxID (https://www.ncbi.nlm.nih.gov/taxonomy), from which the classes/ranks for each species can be retrieved. The second is a running index, so that each sequence in the multi fasta has a different header. This file can then be converted with https://github.com/f-kretschmer/bertax_training/blob/master/preprocessing/fasta2fragments.py to the "fragments"-format. For training (fine-tuning), the CLI-argument --ranks can be used to specify which ranks to train (and thus which output-layers to use). Don't hesitate to ask if you have further questions!

Steven-GUHK commented 1 year ago

Hi!

Since we used the gene-model-structure only in the beginning of our development, more work would be required to adapt the training process for the new task. But for the genomic model, probably not a lot has to be changed. Your data would need to be in the "fragment"-type of structure, which can be generated from multifastas (https://github.com/f-kretschmer/bertax_training#multi-fastas-with-taxids). You would have to concatenate your data into a single file and adapt the header of each sequence in the following way:
>species_1_taxid 0 
ATCG....
>species_1_taxid 1
ATCG....
....
>species_1_taxid m
ATCG....
>species_2_taxid m + 1
ATCG....
...
>species_n_taxid x
ATCG....
The first value of the header is simply the NCBI-TaxID (https://www.ncbi.nlm.nih.gov/taxonomy), from which the classes/ranks for each species can be retrieved. The second is a running index, so that each sequence in the multi fasta has a different header. This file can then be converted with https://github.com/f-kretschmer/bertax_training/blob/master/preprocessing/fasta2fragments.py to the "fragments"-format. For training (fine-tuning), the CLI-argument --ranks can be used to specify which ranks to train (and thus which output-layers to use). Don't hesitate to ask if you have further questions!

Thanks for your suggestion! Here is what I have done:

I concatenated all the fasta files into one fasta file with TaxID and index header as you suggested. Then I used the fatsa2fragments.py to generate two files: _trainfragments.json and _train_speciespicked.txt
Because I don't have ['Viruses', 'Archaea', 'Bacteria', 'Eukaryota'] classes, I just change them to ['train']
I run the command in the GitHub: _python -m models.bert_nc fragments_root_dir --batch_size 32 --head_num 5 --transformer_num 12 --embed_dim 250 --feed_forward_dim 1024 --dropout_rate 0.05 --name bert_ncC2 --epochs 10 and successfully trained a model.
Then I'm going to fine-tune the model. Because I need to predict phylum to species, I modified the _bert_ncfinetune.py as

Then I run the command: _python -m models.bert_nc_finetune bert_nc_C2.h5 fragments_root_dir --multi_tax --epochs 15 --batch_size 24 --save_name _small_trainingset_filtered_fix_classes_selection --store_predictions --nrseqs 1000000000 However, there is a problem:

It seems that it takes too many memories to load all the data at once. I deleted the np.array() operation and the error disappears.

However, I encounter another problem like this: fine-tune-error.log

Do you have any suggestions about my steps above? Thank you very much!

f-kretschmer commented 1 year ago

I haven't seen this error before, could you first try to see if this error also comes up if you change back the np.array lines, maybe by first using a smaller training dataset (so it fits into memory)? Perhaps using np.asarray instead of np.array could also reduce memory size. The data type (and also size!) in your screenshot (Unable to allocate 2.99 TiB for an array .... and data type <U...) is quite strange. The data should all be numerical; perhaps check that the three variables x, y, y_species all have the expected contents (before or at line 76 in bert_nc_finetune.py) .

Steven-GUHK commented 1 year ago

I print the number, type, and value of x, y, y_species before shuffle:

The result is

and after shuffle, the data type is

Steven-GUHK commented 1 year ago

Following the previous problem, I find it doesn't matter whether I use np.array(x) or not in function _loadfragments(). Because I use the _preprocessing.makedataset.py to generate the train.tsv and test.tsv files. I find that the generated files are the same without using np.array(x). Here is a screenshot:

Then I use the train.tsv and test.tsv to train the model with argument _--use_defined_train_testset. However, this problem still exists. fine-tune-error.log

I have uploaded the pre_trained model and two files here: https://drive.google.com/drive/folders/1TUSTrjlGbtYqVBcUmybAVXxLEcvG8duT?usp=sharing Because the train.tsv is too large, you can use the test.tsv two times.

Very appreciate it if you can help me out 🙏

Steven-GUHK commented 1 year ago

I haven't seen this error before, could you first try to see if this error also comes up if you change back the np.array lines, maybe by first using a smaller training dataset (so it fits into memory)? Perhaps using np.asarray instead of np.array could also reduce memory size. The data type (and also size!) in your screenshot (Unable to allocate 2.99 TiB for an array .... and data type <U...) is quite strange. The data should all be numerical; perhaps check that the three variables x, y, y_species all have the expected contents (before or at line 76 in bert_nc_finetune.py) .

Sorry to bother you again, but here is a strange thing: To test whether it is my problem, I use the model provided in resources/bert_nc_C2_final.h5 to fine-tune. I use a small dataset so that there is no ArrayMemoryError problem. The only thing I change is the models/model.py I change the file name:

Here is my command to run the bert_nc_finetune.py: nohup python -m models.bert_nc_finetune resources/bert_nc_C2_final.h5 fragments_root_dir --multi_tax --epochs 15 --batch_size 24 --save_name small --store_predictions --nr_seqs 1000000000 > fine-tune.log 2> fine-tune-error.log

And these are the output: fine-tune-error.log fine-tune.log The problem still exists.

I have uploaded the train_small.fasta, train_small_fragments.json, and train_small_species_picked.txt here: https://drive.google.com/drive/folders/1TUSTrjlGbtYqVBcUmybAVXxLEcvG8duT?usp=sharing

Could you please help me to check why? Is it because I changed the list of names? Thank you very much!

f-kretschmer commented 1 year ago

Just a heads up and apology that I haven't been able to look into it in detail yet. I can't see anything wrong with your data or commands immediately, it might be the case that the error is related to tensorflow internals and occurring because of package version conflicts (keras-bert, which BERTax depends on, does not work with all versions of tensorflow or keras). I'll write back when I find something.

Steven-GUHK commented 1 year ago

Update: After several days try, I update my tensorflow version to 2.12.0 and it can be trained normally. But I have to reduce batch_size to 1 otherwise there will be a memory error. I wonder about the impact of batch_size on the final accuracy.

One more question is I found that the sample_weight in the pre-training model bert_nc.py only contains the superkingdom. As I want to train six ranks, do I need to make some changes in bert_nc.py? Or I just need to change the code in bert_nc_finetune.py.

Also, there is a problem with testing after training:

It seems that the model has two inputs but only received one. If the training process has no problem, how could the test have since the train and test data have the same format?

f-kretschmer commented 1 year ago

Good to hear that changing the tensorflow version solved the first issue!

Regarding the sample-weights, you should be fine with only balancing the highest rank (as is done), as in pre-training the class labels do not get used anyways.
I am wondering about the testing problem as well. The input of the model consists of the tokenized sequence itself and segments, which are not really used for fine-tuning but stem from the pre-training tasks (next sentence prediction). The segment input should be just "0"s with the length of the input sequence/tokens. Although, according to the error message, the function seems to be doing something fundamentally wrong if the input has shape (None, None). You might need to look into the code of the function. Sorry I can't be of more help right now.

Steven-GUHK commented 1 year ago

Thanks for your information. I found that it is true that the generator will return a list that contains tokens and segments and the segments are all 0s. However, I don't know what's wrong with the predict() function in that it doesn't unpack the list. So, I do it manually. Finally, I get the results. But it is awful. The accuracy of phylum is around 0.02.

I use 1075 species' DNA to pre-train and fine-tune the model. In each species, I choose 10 sequences that do not appear in training for testing. So there are 10750 sequences for testing. Here are three logs: pre-training.log.zip fine-tune.log.zip test.log.zip

I find that the final losses are larger than the initial ones. Do you think I should pre-train the model? Or I should just fine-tune your pre-trained model. Thanks!

rnajena / bertax_training

Question about training a BERTax model for phylum to species taxonomy classification #10