about the test data - Githubissues

raphaelmourad / DeepG4

DeepG4: A deep learning approach to predict active G-quadruplexes

9 stars 2 forks source link

about the test data #3

Open kailingli opened 1 year ago

kailingli commented 1 year ago

Hi,

In the "promoters_seq_example.bed" file, what is the 4th and 5th column? I assume 1st-3rd column is "chr", "start", "end" and the 6th is "strand"

I was trying to use my own bed file with only "chr, start, end, strand" to do the prediction but failed. Can you help me with this? Thank you so much!!!

--KL

rochevin commented 1 year ago

Hi, The bed was generated automatically by export.bed function. The 4th column is the name (an id) and the 5th represent a score that is set to zero, as you can see here : https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Hope it will help ! Best, Vincent

kailingli commented 1 year ago

Thank you for your reply. It seems the res <- DeepG4Scan(X = sequences,k=20,treshold=0.5) only work for sequences that longer than 200bp, when I use bedfile containing the sequence less than 200bp it will show this error message :

> res <- DeepG4Scan(X = sequences,k=20,treshold=0.5)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 1, 0

I was wondering is there any way to use one set of code to predict sequences less or longer than 200bp at the same time? Or i just need to separate them and run them twice.

rochevin commented 1 year ago

It seem that you are right, the code who subset a big sequence into smallers one should be responsible of this failure.

For me, it's better to separate the two sets of sequences because it will not tell you the same thing. For sequences with less than 200bp, you want to know if you may have an active G4 or not. For big sequences, you want to "locate" or at least know if you will have a potential active G4 at some location, for insteance if you scan a full promoter region.