hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

pair-end files #12

Closed wjyzidane closed 5 years ago

wjyzidane commented 6 years ago

I found SalmonTE works well when I give it either R1 fastq or R2 fastq. But it gives error as below when I provide both:

image

I am using SalmonTE 0.3.

hyunhwan-jeong commented 6 years ago

There was an error if prefixes of file names of two paired-end FASTQ not same (before _001 in that case), and this has been resolved at 86b1e77d24caa5c59de4bee59f6625cdc0975f81. Please pull the latest version of SalmonTE.

Sorry for your inconvience and many thanks for the reporting!

Hyun-Hwan Jeong

wjyzidane commented 6 years ago

Got it. Thanks!

wjyzidane commented 6 years ago

Hi, I found it still output nothing for the pair-end files:

image image

I am using SalmonTE 0.3

wjyzidane commented 6 years ago

Also, could find a way to support the name like *_r1.fq instead of _R1.fq? Thanks!

hyunhwan-jeong commented 6 years ago

@wjyzidane,

I tried to replicate the issue, using a pair-ended fastq file with the same name, and this works for my case. I am wondering you are using the latest version of the SalmonTE. Could you let me know what you can see if you execute the command?

md5 SalmonTE.py

Here is my output, and this has to be the same as: MD5 (SalmonTE.py) = a8d89b2822199b0cd4c599309631e1d6 If you are not seeing the identical MD5 code, then please pull this git repository to your local.

Furthermore, the case of *_r1.fq has been fixed in my last patch and has to be supported. SalmonTE.py is supposed to automatically detect end type of each fastq file.

Best Regards,

Hyun-Hwan

wjyzidane commented 6 years ago

It works after I pull out the newest version from the git repository! Thanks!

rrcutler commented 5 years ago

It seems that recognizing paired-end files may be left out when they are compressed?

SalmonTE.py --version
SalmonTE 0.4
SalmonTE.py quant --reference=mm --outpath=SalmonTE_output1 --num_threads=30 /home/UTHSCSA/cutlerr/Data/Kalamakis_2019_RNA-Seq/SRA/Bulk_RNA-Seq_Data/Trimmed_reads/Paired/temp1/SRR7290434_R1.fq.gz /home/UTHSCSA/cutlerr/Data/Kalamakis_2019_RNA-Seq/SRA/Bulk_RNA-Seq_Data/Trimmed_reads/Paired/temp1/SRR7290434_R2.fq.gz
2019-03-29 00:25:38,672 Starting quantification mode
2019-03-29 00:25:38,672 Collecting FASTQ files...
2019-03-29 00:25:38,673 The input dataset is considered as a single-end dataset.
2019-03-29 00:25:38,673 Collected 2 FASTQ files.
2019-03-29 00:25:38,674 Quantification has been finished.
2019-03-29 00:25:38,674 Running Salmon using Snakemake
Job counts:
        count   jobs
        1       all
        1       collect_abundance
        1       collect_mappability
        2       run_salmon_gz
        5
2019-03-29 00:25:38,771 Job counts:
        count   jobs
        1       all
        1       collect_abundance
        1       collect_mappability
        2       run_salmon_gz
        5
hyunhwan-jeong commented 5 years ago

@rrcutler Can you provide me the first few lines of each FASTQ file here?

Thank you,

Hyun-Hwan Jeong

rrcutler commented 5 years ago
head SRR7290434_R1.fq
@SRR7290434.1.1 HWI-ST1149:214:C4VMKACXX:6:1101:1355:2115 length=101
AAGCAGTGGTATCAACTCAGAGTACATGCGGAGACTTAGGACTTAGTCTCCCTTTCTCCCTAGGTGTAGAGGGTTCAGCCGTGTGCACCCCCCCCCTTCNN
+SRR7290434.1.1 HWI-ST1149:214:C4VMKACXX:6:1101:1355:2115 length=101
@?@?DF?DFCFHBBF@FFGGIIFHHCHB?FHGGDFHGHII?BGGEHBGIJIIJJGEHGGFHICH@EEGHHHFD?;2?@C98,9=BDCC?8=8<59><AA##
@SRR7290434.2.1 HWI-ST1149:214:C4VMKACXX:6:1101:1637:2151 length=101
AAGCAGTGGTATCAACGCATAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGTGTTTTTTTTTTTTATATTAAAATATAAAAAAAAATTTT
+SRR7290434.2.1 HWI-ST1149:214:C4VMKACXX:6:1101:1637:2151 length=101
CCCFFFFFGHHHHJJJJIHHIIHIJJJJJJJJJJJJHFDDDDDDDDDDDDDDDDDD@5&))&+(+8398&))5>5&((((((+(((+((((((&&&&++4+
@SRR7290434.3.1 HWI-ST1149:214:C4VMKACXX:6:1101:1563:2199 length=101
TACATATTGGCTTCTCCAGAAAATACACGTTTAAACAAGCCATGCACCCATCTCATTTCATTTAATTTTCTGGTCTCTCAGTCTCATCACCTTGACTAGG
head SRR7290434_R2.fq
@SRR7290434.1.2 HWI-ST1149:214:C4VMKACXX:6:1101:1355:2115 length=101
ATAGCAAAGTTAAAATAAATACTAATAACCTCTGTAACAACAGGGAAATCTAGTTCAGTAGCAGCACCTGAAAGGCAGACAGGCAGTCTCGTCAACACANN
+SRR7290434.1.2 HWI-ST1149:214:C4VMKACXX:6:1101:1355:2115 length=101
CCCFFFFFHHHDBGBEHGGGHIIFHGCGEGHEHHFHGHIJJGGIIIGG@HGHHHGIIHIIEGIHBHIIJIJJJJIIBCEHFDF=AA@ACCC;?B:>><(##
@SRR7290434.2.2 HWI-ST1149:214:C4VMKACXX:6:1101:1637:2151 length=101
TGACATTGTAACTATGAATTCATGTTTTAGAATTGTGTGTGCTCCCATGTAAGGAAACCACTTGTTAGTAAAGAAATCCATGGATTATATGTAAAAGAATT
+SRR7290434.2.2 HWI-ST1149:214:C4VMKACXX:6:1101:1637:2151 length=101
@C@FFFFFHHHHGIIJHBJHHIJIIIJJFIJIJIJEHGGFHIJJJIJJJIFIHIJIIJJJIJJJJJJJHIIJJJJJJJJJHHHHEDFFFFFFEADEEDDD>
@SRR7290434.3.2 HWI-ST1149:214:C4VMKACXX:6:1101:1563:2199 length=101
CCACCACCAAAAAAAAAAAAAAAAAATTGATAGGGGATTTTAGGATTTTGAGCCATAGCTAGCCAATATGTTACACATTGTTTTATACAATTTCCTGCTGC

Furthermore, I get the same problem when running with the unzipped files

SalmonTE.py quant --reference=mm --outpath=SalmonTE_output1 --num_threads=30 /home/UTHSCSA/cutlerr/Data/Kalamakis_2019_RNA-Seq/SRA/Bulk_RNA-Seq_Data/Trimmed_reads/Paired/temp1/SRR7290434_R1.fq /home/UTHSCSA/cutlerr/Data/Kalamakis_2019_RNA-Seq/SRA/Bulk_RNA-Seq_Data/Trimmed_reads/Paired/temp1/SRR7290434_R2.fq
2019-03-29 00:34:25,629 Starting quantification mode
2019-03-29 00:34:25,629 Collecting FASTQ files...
2019-03-29 00:34:25,629 The input dataset is considered as a single-end dataset.
2019-03-29 00:34:25,630 Collected 2 FASTQ files.
2019-03-29 00:34:25,630 Quantification has been finished.
2019-03-29 00:34:25,630 Running Salmon using Snakemake
Job counts:
        count   jobs
        1       all
        1       collect_abundance
        1       collect_mappability
        2       run_salmon_fq
        5
2019-03-29 00:34:25,726 Job counts:
        count   jobs
        1       all
        1       collect_abundance
        1       collect_mappability
        2       run_salmon_fq
        5
hyunhwan-jeong commented 5 years ago

@rrcutler I have fixed the problem and created a branch for the test your case. Can you please clone the branch and test whether my fix works for you?

git clone -b paired-end https://github.com/LiuzLab/SalmonTE/

Thank you,

Hyun-Hwan Jeong

rrcutler commented 5 years ago

Things are workings great now - Thanks!

annsophiegironne commented 4 months ago

Hi!

I had a problem with my paired-end data files, for which only half of them would load in. I saw in another issue that the NCBI fastq format was sometimes a problem - I modified my fastq files to fit the original format and now all 16 files (8x paired-end samples) load. However, it says they load as single-end files...

Here are the first few lines of one sample's fastq files:

head /u/gironnea/polyA/scratch/fastq/colon/SRR6410603/salmonTE/SRR6410603_R1.fastq
@SRR6410603.62.1 NS500482:96:HT5M5BGXX:1:11101:20908:1061
ATTCTNCCCCAGCCCAGGCTGGGGTACCCAGAGACCTGGGAAATNNNGNNGNGTCA
+SRR6410603.62.1 NS500482:96:HT5M5BGXX:1:11101:20908:1061.1 length=64
AAAAA#EEEEEEEEEAEEEEEEEEEEEEEEE6E6EE/EEEE6EE###E##E#EEEE
@SRR6410603.63.1 NS500482:96:HT5M5BGXX:1:11101:20617:1062
CTTGTNTTTAGCAGCATTCACCCGTGTCTGTTCACTGACCAAAGNNNANNATTTGTNNNGNNNNNNNNNNNNNC
+SRR6410603.63.1 NS500482:96:HT5M5BGXX:1:11101:20617:1062.1 length=74
AAAAA#EEEEEEEEAEEAEEEEAAEEEEEEEEEEEAEAEEEEEE###A##EE<6A<###E#############A
@SRR6410603.64.1 NS500482:96:HT5M5BGXX:1:11101:15920:1062
CCAGGTTGGAACTTGCAATAACCATCCTTGCCCTGGTAGGGGTANNNGNNTTCACC

head /u/gironnea/polyA/scratch/fastq/colon/SRR6410603/salmonTE/SRR6410603_R2.fastq
@SRR6410603.62.2 NS500482:96:HT5M5BGXX:1:11101:20908:1061
GGTACATACTCATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNC
+SRR6410603.62.2 NS500482:96:HT5M5BGXX:1:11101:20908:1061.2 length=76
AAA6A6E/EEE/AE####################################E#E
@SRR6410603.63.2 NS500482:96:HT5M5BGXX:1:11101:20617:1062
CAAATACCACCCAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTTNA
+SRR6410603.63.2 NS500482:96:HT5M5BGXX:1:11101:20617:1062.2 length=53
AAAAAEEE/EAEEEE#################################EEE#E
@SRR6410603.64.2 NS500482:96:HT5M5BGXX:1:11101:15920:1062
GGCAGTTGCTGGACTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCNTCG

Here is my command: SalmonTE.py quant --reference=hs_grch38 --outpath=$output --num_threads=2 --exprtype=count $path/*/salmonTE/*_R*.fastq

I also tried specifying SalmonTE.py quant --reference=hs_grch38 --outpath=$output --num_threads=2 --exprtype=count $path/*/salmonTE/*_R1.fastq $path/*/salmonTE/*_R2.fastq

but I get the same result:

2024-02-22 12:15:54,346 Starting quantification mode
2024-02-22 12:15:54,346 Collecting FASTQ files...
2024-02-22 12:15:54,361 The input dataset is considered as a single-end dataset.
2024-02-22 12:15:54,361 Collected 16 FASTQ files.
2024-02-22 12:15:54,362 Quantification has been finished.
2024-02-22 12:15:54,362 Running Salmon using Snakemake
2024-02-22 12:15:55,106 Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-22 12:15:55,106 NumExpr defaulting to 8 threads.
Building DAG of jobs...
2024-02-22 12:15:55,425 Building DAG of jobs...

I saw that for some people, it worked when specifying the files for one sample at a time, but I still get the same result: "The input dataset is considered as a single-end dataset.".

Do you have any idea how I could solve this? Otherwise, could I just sum the quantification for each sample?

Thank you so much!