hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

Salmon TE does not load all the fastq #28

Closed SalimMegat closed 3 years ago

SalimMegat commented 5 years ago

Hi, I have been trying to run Salmon TE on a Linux server without success. The installation seems fine since I can call SalmonTE.py without errors. However, when I try to run "Salmon.py quant" it is only loading 1 fastq in the folder I am pointing at (below is the command). However the folder actually contains 36 fastq files. Is it a problem with the naming of the files ? (See below the name of the fastq). Finally, theses fastq are paired-end as you can see and it recognized them as single end.

Thanks for your help !

Salim.

### (snakemake) [smegat@hpc-login1 ~]$ SalmonTE.py quant --reference=mm --outpath=Salmon_out_quant --num_threads=12 ~/fusDNLS_fastq 2019-02-19 09:03:19,755 Starting quantification mode 2019-02-19 09:03:19,755 Collecting FASTQ files... 2019-02-19 09:03:19,756 The input dataset is considered as a single-end dataset. 2019-02-19 09:03:19,757 Collected 1 FASTQ files. 2019-02-19 09:03:19,757 Quantification has been finished. 2019-02-19 09:03:19,757 Running Salmon using Snakemake Job counts: count jobs 1 all 1 collect_abundance 1 collect_mappability 1 run_salmon_fq 4 2019-02-19 09:03:20,969 Job counts: count jobs 1 all 1 collect_abundance 1 collect_mappability 1 run_salmon_fq 4****

fusDNLS_fastq]$ ls SRR6924174_1.fastq.gz SRR6924176_2.fastq.gz SRR6924179_1.fastq.gz SRR6924181_2.fastq.gz SRR6924194_1.fastq.gz SRR6924196_2.fastq.gz SRR6924199_1.fastq.gz SRR6924201_2.fastq.gz SRR6924174_2.fastq.gz SRR6924177_1.fastq.gz SRR6924179_2.fastq.gz SRR6924192_1.fastq.gz SRR6924194_2.fastq.gz SRR6924197_1.fastq.gz SRR6924199_2.fastq.gz Salmon_out_quant SRR6924175_1.fastq.gz SRR6924177_2.fastq.gz SRR6924180_1.fastq.gz SRR6924192_2.fastq.gz SRR6924195_1.fastq.gz SRR6924197_2.fastq.gz SRR6924200_1.fastq.gz SRR6924175_2.fastq.gz SRR6924178_1.fastq.gz SRR6924180_2.fastq.gz SRR6924193_1.fastq.gz SRR6924195_2.fastq.gz SRR6924198_1.fastq.gz SRR6924200_2.fastq.gz SRR6924176_1.fastq.gz SRR6924178_2.fastq.gz SRR6924181_1.fastq.gz SRR6924193_2.fastq.gz SRR6924196_1.fastq.gz SRR6924198_2.fastq.gz SRR6924201_1.fastq.gz

hyunhwan-jeong commented 5 years ago

Hi @SalimMegat,

I am wondering if you see the same error we the following command:

SalmonTE.py quant --reference=mm --outpath=Salmon_out_quant --num_threads=12 ~/fusDNLS_fastq/*.fastq.gz

Hyun-Hwan Jeong

SalimMegat commented 5 years ago

It seems really random. I reran the command : SalmonTE.py quant --reference=mm --outpath=Salmon_out_quant --num_threads=16 ~/fusDNLS_fastq/ and now it is collecting 36 fastq files but still handle them as single end while they are paired ...

hyunhwan-jeong commented 5 years ago

@SalimMegat I am sorry to hear you are having trouble.

However, I need to figure out what the problem is so I would like to ask a couple of questions:

zcat SRR6924174_1.fastq.gz | head -8
zcat SRR6924174_2.fastq.gz | head -8

if zcat doesn't work, then you may need to use gzcat.

Thank you,

Hyun-Hwan Jeong

SalimMegat commented 5 years ago

Hi,

(snakemake) [smegat@hpc-login1 fusDNLS_fastq]$ zcat SRR6924174_1.fastq.gz | head -8 @SRR6924174.1.1 HWI-D00436:365:CAVYPANXX:2:1101:1477:1848 length=101 NTGGGATTAAAGGTGTTTTTTTAGTTTTCAAGACAGCATTTCTCTGTTCCTGGCTGTCCTGGAACTTGATCTGTAGACAAGGCTGGCCTCAAATCAGAGAA +SRR6924174.1.1 HWI-D00436:365:CAVYPANXX:2:1101:1477:1848 length=101

<<ABGGGGGGGGFGGGGGGGGFGGEGGGGGGGGGGGFGGGGGEGGGGGGGGGFEFGDBGGFBEGECGEGGGGGGGGGGGGGGGGBDGGGFCGGGGGGE00

@SRR6924174.2.1 HWI-D00436:365:CAVYPANXX:2:1101:1451:1925 length=101 TCTCCATCCAGGTGGTGCTTTCGGGCAAGGTAGCGCAGGATGGCATTGCTCTGGGTGATCTTGTGTGATCCATCGATCAAGTAAGGCAGATTGGGAAAGTC +SRR6924174.2.1 HWI-D00436:365:CAVYPANXX:2:1101:1451:1925 length=101 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Thanks !

Salim.

hyunhwan-jeong commented 5 years ago

@SalimMegat, One thing I guess is, there is a problem when SalmonTE parses FASTQ files. SalmonTE works fine with if your FASTQ files contain Illumina Sequence identifiers but are not guaranteed works with NCBI Sequence identifiers. I will take a look as soon as possible and let you know when I solve the problem. Thanks for your reporting!

Hyun-Hwan Jeong

hyunhwan-jeong commented 5 years ago

@SalimMegat, during that time I am solving the problem. I believe this will be a quick and dirty solution for your case. Can you redownload a couple of samples (3-4) using fastq-dump with --origfmt parameter and run SalmonTE again? Doing this and reporting the result will help me, but it is up to you.

Hyun-Hwan Jeong

Papareddy commented 5 years ago

hi ,

Quantification is ultra fast as the title deserves.

I am running salmonTE on my local computer .And gave the full path

But i have ran into the same problem . My paired end data is not readable.

PairedEnd reads are not able to be read.

nbm-gmi-39:SalmonTE ranjith.papareddy$ SalmonTE.py quant --reference=Ath --num_threads=12 BC.Col.1_1.fastq BC.Col.1_2.fastq BC.nrpd3a.1_1.fastq BC.nrpd3a.1_2.fastq 2019-03-25 21:14:57,635 Starting quantification mode 2019-03-25 21:14:57,635 Collecting FASTQ files... 2019-03-25 21:14:57,639 The input dataset is considered as a paired-ends dataset. Traceback (most recent call last): File "/Users/ranjith.papareddy/SalmonTE/SalmonTE.py", line 286, in run(args) File "/Users/ranjith.papareddy/SalmonTE/SalmonTE.py", line 232, in run param = {args, collect_FASTQ_files(args['FILE'])} File "/Users/ranjith.papareddy/SalmonTE/SalmonTE.py", line 119, in collect_FASTQ_files os.symlink(os.path.abspath(a), os.path.join(tmp_dir, trim_a)) FileExistsError: [Errno 17] File exists: '/Users/ranjith.papareddy/SalmonTE/BC.nrpd3a.1_1.fastq' -> '/var/folders/kz/p5lq4zg5633fkrv98d997x7w000_kg/T/tmpubn_t822/_R1.fastq'

But when i input only matePair 1 its doing the job

nbm-gmi-39:SalmonTE ranjith.papareddy$ SalmonTE.py quant --reference=Ath --num_threads=12 Col.1_R1.fastq nrpd3a.1_R1.fastq 2019-03-25 21:18:10,227 Starting quantification mode 2019-03-25 21:18:10,227 Collecting FASTQ files... 2019-03-25 21:18:10,229 The input dataset is considered as a single-end dataset. 2019-03-25 21:18:10,230 Collected 2 FASTQ files. 2019-03-25 21:18:10,230 Quantification has been finished. 2019-03-25 21:18:10,230 Running Salmon using Snakemake Job counts: count jobs 1 all 1 collect_abundance 1 collect_mappability 2 run_salmon_fq 5 2019-03-25 21:18:10,417 Job counts: count jobs 1 all 1 collect_abundance 1 collect_mappability 2 run_salmon_fq 5 Job counts: count jobs 1 collect_abundance 1 Job counts: count jobs 1 collect_mappability 1 nbm-gmi-39:SalmonTE ranjith.papareddy$

and my samples are nor SRR down loads.

example headers for for matepair 1 and 2:

nbm-gmi-39:SalmonTE ranjith.papareddy$ head -n 8 Col.1_R1.fastq @7001253F:342:HF5CCBCXX:1:1105:1208:2172#34637_TAAGGCGATATCCTCT/1 GCGTAATGTTGCTTGCTTCCGCGGTTTCCATGTTCTGCTTAGGCTGGGTC + BD@DBD1GHEEHHHGIIIIG@EHGCHHHH@CFGCG@EHHIIHEHFEHFGE @7001253F:342:HF5CCBCXX:1:1105:1168:2179#34637_TAAGGCGATATCCTCT/1 GCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAAAGAAAAT + DDDDBEGHHIIIIHIIIIHHHHIIIIIIIIIIIIIIHID/1<<1111<11 nbm-gmi-39:SalmonTE ranjith.papareddy$ head -n 8 Col.1_R2.fastq @7001253F:342:HF5CCBCXX:1:1105:1208:2172#34637_TAAGGCGATATCCTCT/2 GATCAGACCCAGCCTAAGCAGAACATGGAAACCGCGGAAGCAAGCAACAT + B@D?DIIIIIIHIIGHEHGEHFIHIHHIIIIIIIIIIIHHHHIHIIIIIH @7001253F:342:HF5CCBCXX:1:1105:1168:2179#34637_TAAGGCGATATCCTCT/2 TGCTAAACCAATAGGATATCANCNATTTCTCACAATTATCTTTCAANANN + <0<0DHH@111111111<1<C#<#<<<<C?CFHHEEECHHFEHHIE#<##

Many thanks, Ranj

hyunhwan-jeong commented 5 years ago

@Papareddy, I am sorry to hear you have some trouble. Can you replace all dots which precede the extension to -? e.g. BC.Col.1_1.fastq to BC-Col-1_1.fastq?

Thank you,

Hyun-Hwan Jeong

Papareddy commented 5 years ago

Hyun-Hwan Jeong, Thanks for the quick response.I really appreciate. I did work after tye suggested changes (aha).

Ranj

hyunhwan-jeong commented 5 years ago

Hyun-Hwan Jeong, Thanks for the quick response.I really appreciate. I did work after tye suggested changes (aha).

So does it work after you changed the file names?

savytskanatalia commented 4 years ago

Good afternoon, I seem to have the same (or very similar) problem. I have 12 files of 6 paired-end samples; indicating a directory, containing all files, results in SalmonTE identifying single .fq single-mate file image

File names in the 'fastq' subdirectory are: sample01_R1.fq.gz sample01_R2.fq.gz sample02_R1.fq.gz sample02_R2.fq.gz sample03_R1.fq.gz sample03_R2.fq.gz sample04_R1.fq.gz sample04_R2.fq.gz sample05_R1.fq.gz sample05_R2.fq.gz sample06_R1.fq.gz sample06_R2.fq.gz

When I attempt to simply specify all files, I get an error "One of input files can be paired to multiple files" Command: SalmonTE.py quant --reference=mm --num_threads=6 --outpath=salmonte/ --exprtype=count fastq/sample01_R1.fq.gz fastq/sample01_R2.fq.gz fastq/sample02_R1.fq.gz fastq/sample02_R2.fq.gz fastq/sample03_R1.fq.gz fastq/sample03_R2.fq.gz fastq/sample04_R1.fq.gz fastq/sample04_R2.fq.gz fastq/sample05_R1.fq.gz fastq/sample05_R2.fq.gz fastq/sample06_R1.fq.gz fastq/sample06_R2.fq.gz

When I tried running with ".fq.gz" identifier, I get "One of input files can be paired to multiple files" Command: `SalmonTE.py quant --reference=mm --num_threads=6 --outpath=salmonte/ --exprtype=count fastq/.fq.gz`

When I simply run single sample with command, everything works perfectly. Command: SalmonTE.py quant --reference=mm --num_threads=6 --outpath=salmonte/ --exprtype=count fastq/sample01_R1.fq.gz fastq/sample01_R2.fq.gz

I tried changing names of files ("sample_0_1.fastq.gz", "sample_0_R1.fastq.gz", "sample0*_R1.fastq.gz"); it didn`t help.

I also tried changing number of threads, as @SalimMegat mantioned it induced collection of all his 36 files. Changing a number of threads had no effect on the outcome of my runs...

Thank you for your help!

Natalia.

hyunhwan-jeong commented 4 years ago

Can you share the files? I guess there are some problems with your fastq file formatting. You don't have to share the entire files, and the first few lines of each file should be fine.

Thank you,

Hyun-Hwan Jeong

savytskanatalia commented 4 years ago

Can you share the files? I guess there are some problems with your fastq file formatting. You don't have to share the entire files, and the first few lines of each file should be fine.

Thank you,

Hyun-Hwan Jeong

Here is the header for one of the files (the rest have same formatting): image

Thank you,

Natalia.

hyunhwan-jeong commented 4 years ago

@savytskanatalia, SalmonTE automatically detects FASTQ pairs by the first line of FASTQ files, and it assumes that the first line (the identifier of head) is correctly formed. I believe SalmonTE handles the FASTQ file whose format is one of the formats on the Wikipedia page. So, I recommend you to follow the one of the formats. Let me know if you need any help.

Best,

Hyun-Hwan Jeong

hyunhwan-jeong commented 4 years ago

@savytskanatalia do you still have the problem?

Hyun-Hwan Jeong

savytskanatalia commented 4 years ago

@hyunhwaj I edited the fastq`s readnames and quantification step worked like a candy! Thank you for help very much!

Best, Natalia.

brunmer4 commented 4 years ago

Hi @hyunhwaj,

it seems I ran into a similar issue as @savytskanatalia.

Here is the header for one of my files: @MG00HS15:1192:HKGTHBCXY:1:1101:2123:1999 1:N:0:ACTGAT NTGAGTTTAGCAAGGTGGCAGGATACAGGATCAACATACAAACATCAATTGCATTTCTATATTCTAGCAATAAACATGTGGAAATTGAAATTAAAAATTCA +

<DDDHIIIIIIIIIIIIIIIIHIIIIIHHIIIIIIIHIIIIIIIIIIHHIIIIIIIIIIIIIIIGHIIIIIIHIIHIIIGIIIIIHIIHIIIHIHHIIHH

@MG00HS15:1192:HKGTHBCXY:1:1101:2306:1997 1:N:0:ACTGAT NTCATAAATAATCCATGAAATAAACTTTACTTCTGGCTCCTCTGAATAGTATCATGTTATTCCCGGAACACAAGGTATTAGTCAAGTGGCTGCTGTGAAGT +

<<BDCGHIH<DGHE@FFGHIIIHIIHHHHC<GHICHHGHHHHE<@FHHHHH?HHIIHFHECGHFHHCHGIHEHGCF@HCHIEEEHIIHICC1<FG?1DF<

@MG00HS15:1192:HKGTHBCXY:1:1101:3320:2000 1:N:0:ACTGAT NGGATCAAAGGGTTATAAAAGTCTGTGACAATCTGATGGCCATACCAGGAGCAAGCTACCAAGGCGGCAAGACCTGCCACGATGAAAATTATGCCTCCA

According to Wiki the header corresponds to the Illumina 1.8 format. How should I change it to make it work? @savytskanatalia do you mind sharing how you solved the issue?

Thanks, -Reinhard

hyunhwan-jeong commented 4 years ago

@brunmer4 Can you share more details on your FASTQ files? Are they paired-end? If it is then, can you share the first few lines of another pair for this sample?

Thank you!

Hyun-Hwan Jeong

brunmer4 commented 4 years ago

Hi @hyunhwaj,

the files are indeed paired end. Here are the first lines of one of the pairs (there were 5 pairs in the folder):

#file1: negcon1a_1.fq

@MG00HS15:1192:HKGTHBCXY:1:1101:2123:1999 1:N:0:ACTGAT
NTGAGTTTAGCAAGGTGGCAGGATACAGGATCAACATACAAACATCAATTGCATTTCTATATTCTAGCAATAAACATGTGGAAATTGAAATTAAAAATTCA
+
#<DDDHIIIIIIIIIIIIIIIIHIIIIIHHIIIIIIIHIIIIIIIIIIHHIIIIIIIIIIIIIIIGHIIIIIIHIIHIIIGIIIIIHIIHIIIHIHHIIHH
@MG00HS15:1192:HKGTHBCXY:1:1101:2306:1997 1:N:0:ACTGAT
NTCATAAATAATCCATGAAATAAACTTTACTTCTGGCTCCTCTGAATAGTATCATGTTATTCCCGGAACACAAGGTATTAGTCAAGTGGCTGCTGTGAAGT
+
#<<BDCGHIH<DGHE@FFGHIIIHIIHHHHC<GHICHHGHHHHE<@FHHHHH?HHIIHFHECGHFHHCHGIHEHGCF@HCHIEEEHIIHICC1<FG?1DF<
@MG00HS15:1192:HKGTHBCXY:1:1101:3320:2000 1:N:0:ACTGAT
NGGATCAAAGGGTTATAAAAGTCTGTGACAATCTGATGGCCATACCAGGAGCAAGCTACCAAGGCGGCAAGACCTGCCACGATGAAAATTATGCCTCCACC
+
#<DDDIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIHIHIIIIIIIIIIIHIIIIIIIIHIIIIII

#file2: negcon1a_2.fq

@MG00HS15:1192:HKGTHBCXY:1:1101:2123:1999 2:N:0:ACTGAT
ATGTTATTAGATTCACATCTGAGTATTTCACCTTTCTTAAGTAATTGTAAGTGGTACTGAATTTTTAATTTCAATTTCCACATGTTTATTGCTAGAATATA
+
DDBDDIIIHHHHHHHHHIIIIIIIEHIIIIIIIHGIIIIIIIFHIICGHHHHIHHHHIIIIIIIHIEFHHHCHEHIHHHIHCHHHHIEHHHHHHIHHIGHH
@MG00HS15:1192:HKGTHBCXY:1:1101:2306:1997 2:N:0:ACTGAT
CAGCAACTGCCTGGATGTATTTGATTTTTTAAAGCGTAGACATATATTTATGAATGTGCATTTCTTGACTTCACAGCAGCCACTTGACTAATACCTTGTGT
+
DDDBDHGIIIHEHHIHIIIIIIIG?FH@FHFHEHHHHCCHHHIEHHGHH?GCGHII<GHHEGH@FGHHFEEGHEHHHIIHHHHHEHH@GHEHEHEHI1FGF
@MG00HS15:1192:HKGTHBCXY:1:1101:3320:2000 2:N:0:ACTGAT
GGATGGACTGCGTCACGCAGAGCACGGGGATGATGAGCTGCAAAATGTACGACTCGGTGCTCGCCCTGTCCGCGGCCTTGCAGGCCACTCGAGCCCTAATG
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIHHHIHHHIHIIIIIIIHIIIIIIIIIIIEHGHIHHHIII@CHFIHIFH
@MG00HS15:1192:HKGTHBCXY:1:1101:4062:1997 2:N:0:ACTGAT
CAAGCCTCAGAGTACTTCGAGTCTCCCTTCACCATTTCCGACGGCATCTACGGCTCAACATTTTTTGTAGCCACAGGCTTCCACGGACTTCACGTCATTAT
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIHIIIIIIIIIIIIEEHHIIEHIIIIIHHIIIDHGIIIIIHIHIHII

This is what I got when I tried to run it:

SalmonTE.py quant --reference=hs --outpath=/mnt/100GB/SalmonTE /mnt/400GB/RNA-seq/fastq
2020-03-23 16:46:49,269 Starting quantification mode
2020-03-23 16:46:49,269 Collecting FASTQ files...
2020-03-23 16:46:49,270 The input dataset is considered as a single-end dataset.
2020-03-23 16:46:49,270 Collected 1 FASTQ files.
2020-03-23 16:46:49,270 Quantification has been finished.
2020-03-23 16:46:49,270 Running Salmon using Snakemake
Job counts:
    count   jobs
    1   all
    1   collect_abundance
    1   collect_mappability
    1   run_salmon_fq
    4
2020-03-23 16:46:49,551 Job counts:
    count   jobs
    1   all
    1   collect_abundance
    1   collect_mappability
    1   run_salmon_fq
    4

I get a different output when I add "/" at the end: SalmonTE.py quant --reference=hs --outpath=/mnt/100GB/SalmonTE /mnt/400GB/RNA-seq/fastq/ 2020-03-23 16:00:24,276 Starting quantification mode 2020-03-23 16:00:24,276 Collecting FASTQ files... 2020-03-23 16:00:24,276 SalmonTE assumes that '/mnt/400GB/RNA-seq/fastq/' is a directory, and SalmonTE will search any FASTQ file in the directory. 2020-03-23 16:00:24,369 One of input files can be paired to multiple files

When I run the files individually it works: SalmonTE.py quant --reference=hs --outpath=/mnt/100GB/SalmonTE /mnt/400GB/RNA-seq/fastq/negcon1a_1.fq.gz

So in the end I ran the files one-by-one and manually modified the condition.csv EXPR.csv MAPPING_INFO.csv output files to collate the results and proceed to the SalmonTE.py test step. Works fine, only disadvantage is that I have separate results for the two paired files.

Best regards, -Reinhard

hyunhwan-jeong commented 3 years ago

@brunmer4 sorry for the late response. I checked the file, and the files should be considered as a paired-end.

$ python3 SalmonTE.py quant --reference=hs example/counter_example
2020-07-23 06:50:44,504 Starting quantification mode
2020-07-23 06:50:44,504 Collecting FASTQ files...
2020-07-23 06:50:44,504 SalmonTE assumes that 'example/counter_example' is a directory, and SalmonTE will search any FASTQ file in the directory.
2020-07-23 06:50:44,508 The input dataset is considered as a paired-ends dataset.
2020-07-23 06:50:44,508 Collected 1 FASTQ files.
2020-07-23 06:50:44,508 Quantification has been finished.
2020-07-23 06:50:44,508 Running Salmon using Snakemake

Did you test with the latest version of SalmonTE?

Thank you,

Hyun-Hwan Jeong