Open Carrotljw opened 3 years ago
hi @Carrotljw, you are right about the process of the codes. We only used BERT pre-trained models from Google, we did not train the model by ourselves.
Thank you for your reply
Hello, I have a question about the seq files generated in the first step and the Input.txt used in the second step. What is the relationship between these two files? Or how should I generate Input.txt?
Hello @khanhlee , I have the same confuse as Iris7788. What is the meaning of the First Step? Should I use .seq file as Input.txt or use full sequence file (like "enhancer.cv.unigram.txt") as input in Step 2? Hope to be answered, thanks!
Hi @Iris7788 @Titantime314 the first step is to split the full file into individual sequence files, and then we will use these seq files in the next step. (Because we use BERT model to generate features from seq file one-by-one, so this step is necessary) Input.txt is an example of the output seq file. In real practice, you will replace it as your file.
Thank you for your reply! So dose that mean we need to use thousands of .csv files which obtained in the Step 2&3 to train the CNN model? I also have a question about the size of training set (200x768) , as 200 is the lenth of DNA sequence, but where did 768 come from? Because I can only generate .csv file size 200x1024 by Step 2&3. Thanks a lot for answering my questions!
Hi @Titantime314 , It depends on the model you used, and then the output vector is with different dimensions. Since I used BERT-Base, Multilingual Cased model, the output is at 768 dimensions. But if you used BERT-Large, the dimension should be 1024. Best regards!
Meticulous reply, thank you very much!
Dear @khanhlee
I'm sorry but there are some incompatibilities and uncertainties in both the source code and README instructions.
Thus, I want to be sure that I understand the process correctly.
In Step 1, we generate thousands of .seq
files, each consists of a 200bp-sequence.
In Step 2, we run the command in bert2json.txt
for each of these .seq
files, respectively. So, we obtain thousands of feature files in JSONL format.
In Step 3, we convert all these JSONL files to CSV files.
In Step 4, we run the 2D_CNN.py
using these CSV files. (Here, you are reading enhancer_test
and non_cv
files. I guess these files consist of enhancer and non-enhancer sequences generated for enhancer.cv.txt
and non.cv.txt
, respectively. And the train/test split placed just after the file reading is actually a train/validation split. Right?)
Thanks in advance for your reply :)
The same problem with smtnkc, and the other is that if the csv file generated by input.txt is input into cnn, is it impossible, right? The amount of data is too small? Please answer, thank you! @smtnkc @khanhlee
Can I ask you some questions about the code framework?I'd like to know if there's anything wrong with my understanding of the code. Step 1: You use extract_seq.py to process the DNA sequences and process them into seq files, each of which has a DNA sequence of size 200 Step 2:You used the command in bert2json.txt to get the characteristics of the DNA sequence.This step does not retrain the Bert model, but directly uses the Bert model published by Google to extract features and obtain the JSON file of DNA Step 3:Convert the JSON file obtained in the previous step to a CSV file Step 4:Use data from CSV files as input to CNN, and train a 2DCNN model Above is my understanding of your code, if there is a mistake, please point out the error. That will help me a lot!Thank you very much for your help.