khanhlee / bert-enhancer

A Transformer Architecture Based on BERT and 2D Convolutional Neural Network to Identify DNA Enhancers from Sequence Information
24 stars 13 forks source link

the general framework of the model #8

Open Carrotljw opened 3 years ago

Carrotljw commented 3 years ago

Can I ask you some questions about the code framework?I'd like to know if there's anything wrong with my understanding of the code. Step 1: You use extract_seq.py to process the DNA sequences and process them into seq files, each of which has a DNA sequence of size 200 Step 2:You used the command in bert2json.txt to get the characteristics of the DNA sequence.This step does not retrain the Bert model, but directly uses the Bert model published by Google to extract features and obtain the JSON file of DNA Step 3:Convert the JSON file obtained in the previous step to a CSV file Step 4:Use data from CSV files as input to CNN, and train a 2DCNN model Above is my understanding of your code, if there is a mistake, please point out the error. That will help me a lot!Thank you very much for your help.

khanhlee commented 3 years ago

hi @Carrotljw, you are right about the process of the codes. We only used BERT pre-trained models from Google, we did not train the model by ourselves.

Carrotljw commented 3 years ago

Thank you for your reply

Iris7788 commented 3 years ago

Hello, I have a question about the seq files generated in the first step and the Input.txt used in the second step. What is the relationship between these two files? Or how should I generate Input.txt?

Titantime314 commented 3 years ago

Hello @khanhlee , I have the same confuse as Iris7788. What is the meaning of the First Step? Should I use .seq file as Input.txt or use full sequence file (like "enhancer.cv.unigram.txt") as input in Step 2? Hope to be answered, thanks!

khanhlee commented 3 years ago

Hi @Iris7788 @Titantime314 the first step is to split the full file into individual sequence files, and then we will use these seq files in the next step. (Because we use BERT model to generate features from seq file one-by-one, so this step is necessary) Input.txt is an example of the output seq file. In real practice, you will replace it as your file.

Titantime314 commented 3 years ago

Thank you for your reply! So dose that mean we need to use thousands of .csv files which obtained in the Step 2&3 to train the CNN model? I also have a question about the size of training set (200x768) , as 200 is the lenth of DNA sequence, but where did 768 come from? Because I can only generate .csv file size 200x1024 by Step 2&3. Thanks a lot for answering my questions!

khanhlee commented 3 years ago

Hi @Titantime314 , It depends on the model you used, and then the output vector is with different dimensions. Since I used BERT-Base, Multilingual Cased model, the output is at 768 dimensions. But if you used BERT-Large, the dimension should be 1024. Best regards!

Titantime314 commented 3 years ago

Meticulous reply, thank you very much!

smtnkc commented 3 years ago

Dear @khanhlee

I'm sorry but there are some incompatibilities and uncertainties in both the source code and README instructions.

Thus, I want to be sure that I understand the process correctly.

In Step 1, we generate thousands of .seq files, each consists of a 200bp-sequence.

In Step 2, we run the command in bert2json.txt for each of these .seq files, respectively. So, we obtain thousands of feature files in JSONL format.

In Step 3, we convert all these JSONL files to CSV files.

In Step 4, we run the 2D_CNN.py using these CSV files. (Here, you are reading enhancer_test and non_cv files. I guess these files consist of enhancer and non-enhancer sequences generated for enhancer.cv.txt and non.cv.txt, respectively. And the train/test split placed just after the file reading is actually a train/validation split. Right?)

Thanks in advance for your reply :)

NEMX2022 commented 3 years ago

The same problem with smtnkc, and the other is that if the csv file generated by input.txt is input into cnn, is it impossible, right? The amount of data is too small? Please answer, thank you! @smtnkc @khanhlee