jvanheld / IBIS_2024

Participation to the IBIS nebchmarking for motif discovery approaches
GNU General Public License v3.0
0 stars 0 forks source link

convert-seq -from fastq -to fasta #3

Closed jvanheld closed 3 months ago

jvanheld commented 4 months ago

The artificial High-Throughput selex (HTS) files are provided in fastq format. We need to add an option to convert-seq in order to accept fatsq format as input.

brunocontrerasmoreira commented 4 months ago

Hi @jvanheld , commit https://github.com/rsa-tools/rsat-code/commit/87a220b2c40039d221d3ca412a0431dfb5fe9c41 adds FASTQ support to sub ReadNextSequence . So, a sample FASTQ file test.fq can be converted to FASTA as follows:

$ cat test.fq   
@m64268e_230602_135049/10/ccs np=8 rq=0.998228
ATGCTAAAGAAAAAGTAAAATAAAATTTAAGTAAACAAGTAAATAAAACACATGCATGCA
+
idzX}N~~c;t~~~t~I~~~~@~kfU~~U~}~S~syi~hYE~~p<~~|`hbtigD;f~\f 
@m64268e_230602_135049/13/ccs np=3 rq=0.992424
TAAATGTATTTCTCCTCTATCTATTGTGGATTGGGTTTCGAAGTGAGGATAAGCAGAGGA
+
O?c_QRYW<GUQ>B`%JWXVQWXNcOJCOVH[0B@%3AQLIX>RSXFeXXM_QRH5O8^F

$ convert-seq -i test.fq -from fastq -to fasta
>@m64268e_230602_135049/10/ccs np=8 rq=0.998228
ATGCTAAAGAAAAAGTAAAATAAAATTTAAGTAAACAAGTAAATAAAACACATGCATGCA
>@m64268e_230602_135049/13/ccs np=3 rq=0.992424
TAAATGTATTTCTCCTCTATCTATTGTGGATTGGGTTTCGAAGTGAGGATAAGCAGAGGA

Please give it a try, Bruno

jvanheld commented 4 months ago

Great, you are faster than batman ! I will test it this evening and integrate it in a makefile.

brunocontrerasmoreira commented 4 months ago

So can you confirm this works as expected?

jvanheld commented 4 months ago

I had no chance to test it yet, but I intend to doit in the evening. For the time being I only treated two data types

jvanheld commented 4 months ago

Hi @brunocontrerasmoreira

The reading of fastq and fastq.gz works fine with convert-seq. I however realized that peak-motifs did not contain an option -seq_format. I added it and submitted the update to github, but it will not be usable directly.

However, I can easily manage with the makfeile, by setting a condition. Indeed for genomic data, I have to use fetch-sequences in order to get fasta sequences from peak coordinates. I will just add a conditional statement so that if the input format is fastq.gz I use convert-seq, and if I have bed files with peak coordinates I run fetch-sequences.

So in any cases peak-motifs will take as input a fasta file.

brunocontrerasmoreira commented 4 months ago

I can generate a new Docker container on Monday...

jvanheld commented 4 months ago

In the meantime I treated sequence conversion in the makefiles, which is finally not so bad.

But it is definitely useful to have the possibility to specify sequence format in peak-motifs, to generalize its use.

jvanheld commented 3 months ago

Job done