hoelzer-lab / rnaflow

A simple RNA-Seq differential gene expression pipeline using Nextflow
GNU General Public License v3.0
95 stars 20 forks source link

Process `expression_reference_based:tpm_filter` input file name collision #190

Closed AGunnell77 closed 2 years ago

AGunnell77 commented 2 years ago

Hi I have come across this issue while running the pipeline:

Error executing process > 'expression_reference_based:tpm_filter'

Caused by: Process expression_reference_based:tpm_filter input file name collision -- There are multiple input files for each of the following file names: null.counts.tsv

fischer-hub commented 2 years ago

Hi @AGunnell77, thanks for reporting this issue!

Can you post the full command you used when running into the error and attach the full log (.nextflow.log in the project directory) file please?

Thanks in advance!

AGunnell77 commented 2 years ago

HI is this OK? Not sure how to use the project directory.

nextflow run hoelzer-lab/rnaflow --reads ~/path/RNAflow_sample_sheet.csv --autodownload hsa --pathway hsa --strand 2 -with-tower --condaCacheDir ~/path/condacache/ --skip_sortmerna -resume

.nextflow.log.4.txt

fischer-hub commented 2 years ago

Thank you @AGunnell77, that is already very helpful!

According to the log file your sample names resolve to null instead of the real sample name. Would you mind attaching your RNAflow_sample_sheet.csv, or maybe just the first few lines? Maybe the issue is already there and the pipeline doesn't read the sample names in correctly :)

Cheers!

AGunnell77 commented 2 years ago

Hi This is my sample sheet.

sample,R1,R2,Condition,Source,strandedness PARENTAL_rep1,/path/K010001_Parental_1_S37_R1_001.fastq.gz,/path/K010001_Parental_1_S37_R2_001.fastq.gz,Parental,,2 PARENTAL_rep2,/path/K010002_Parental_2_S38_R1_001.fastq.gz,/path/K010002_Parental_2_S38_R2_001.fastq.gz,Parental,,2 PARENTAL_rep3,/path/K010003_Parental_3_S39_R1_001.fastq.gz,/path/K010003_Parental_3_S39_R2_001.fastq.gz,Parental,,2 KO_rep1,/path/K010004_KnO_1_S40_R1_001.fastq.gz,path/K010004_KnO_1_S40_R2_001.fastq.gz,TKO,,2 KO_rep2,/path/K010005_KnO_2_S41_R1_001.fastq.gz,/path/K010005_KnO_2_S41_R2_001.fastq.gz,TKO,,2 KO_rep3,/path/K010006/K010006_KnO_3_S42_R1_001.fastq.gz,path/K010006_KnO_3_S42_R2_001.fastq.gz,TKO,,2

Thanks

fischer-hub commented 2 years ago

Many thanks @AGunnell77 ! It seems the header of your input.csv file is slightly incorrect, your header is:

sample,R1,R2,Condition,Source,strandedness

but the pipeline expects a header of this kind:

Sample,R1,R2,Condition,Source,Strandedness

The column names in the input.csv are case sensitive, so sample will not be recognized but Sample will. Maybe we should change this to accept both lowercase and uppercase columnnames :)

Just correct the lowercase columnnames in your input.csv header and you should be good to go! Or just copy the correct header I posted above. Then you should also see that the processes show which sample they are currently processing, e.g. like this: process > preprocess_illumina:fastqcPre (PARENTAL_rep1) instead of process > preprocess_illumina:fastqcPre (null)

Let me know if this fixed the issue!

AGunnell77 commented 2 years ago

Many Thanks! I'll give it a go. Andrea

AGunnell77 commented 2 years ago

Hi, the change in the sample sheet got me a bit further but the process then stopped with an error report as attached (error_report_nauseous_wilson). I tried resuming but the deseq2 stage is still not proceeding (nextflowlogserene_galileo) nextflowlogserene_galileo.txt error_report_nauseous_wilson.txt Thanks Andrea

fischer-hub commented 2 years ago

Hi @AGunnell77 !

It seems the deseq2 script is crashing for some reason. For better understanding of whats going on, could you please attach the deseq2 log file of the ? You can find the file in the working directory of the process : /data/scratch/DGE/DUDGE/MOPOPGEN/agunnell/RNAflowrevcorrect/work/d6/6b49f615bbb881b603cce213e059e8/deseq2.Rout

AGunnell77 commented 2 years ago

Here is the file. It looks like there are no counts due to it not being recognised as paired-end data and excluding all the paired-end alignments? deseq2.Rout.txt

Status PARENTAL_rep2.sorted.bam Assigned 0 Unassigned_Unmapped 3356569 Unassigned_Read_Type 255871819 Unassigned_Singleton 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 0 Unassigned_Secondary 0 Unassigned_NonSplit 0 Unassigned_NoFeatures 0 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 0

Process BAM file PARENTAL_rep2.sorted.bam... || || Strand specific : reversely stranded || || WARNING: Paired-end reads were found and excluded. || || Total alignments : 259228388 || || Successfully assigned alignments : 0 (0.0%) || || Running time : 3.99 minutes

fischer-hub commented 2 years ago

@AGunnell77 that looks odd indeed. However, the first log file .nextflow.log4 suggested that the pipeline detected the read mode correctly as paired-end. Can you check if the .bam files really contain any alignments? Im no quite sure where the log file is from that you posted above?

AGunnell77 commented 2 years ago

HI they were from here: /work/08/523f900061b92f7d277e0692829ce0/.command.log /work/08/523f900061b92f7d277e0692829ce0/PARENTAL_rep2.counts.tsv.summary I had this previously prior to running into the input file name collision. When I added a -p to this command.sh here (see below) and ran the command within this work directory it then showed all the reads alligned in the feature counts summary... but then I had the file name collision and started from scratch with the updated input csv. /data/scratch/DGE/DUDGE/MOPOPGEN/agunnell/RNAflowrevcorrect/work/08/523f900061b92f7d277e0692829ce0/.command.sh

i.e featureCounts -p -T 1 -s 2 -a annotation.gtf -o PARENTAL_rep2.counts.tsv -t exon -g gene_id PARENTAL_rep2.sorted.bam

fischer-hub commented 2 years ago

@AGunnell77 So after fixing the input.csv did you run the pipeline from scratch am I understanding correctly? Or did you run only part of the pipeline e.g. because you restarted it with -resume? Im currently testing on some paired-end data but this issue never occurred to me before. If the pipeline detects paired-end reads from the input.csv, featurecounts should be run with the -p parameter usually.. If you didn't run the pipeline from scratch (so without -resume) with the corrected input.csv, could you please do so and tell me if the featurecounts summary still reports the same issue?

Cheers

AGunnell77 commented 2 years ago

HI I started from scratch after changing the input file. All the work and results file in this directory were deleted. I resumed once at the point of the deseq2 error in case it was a glitch but that was after the input file was corrected.

AGunnell77 commented 2 years ago

HI , so I manually added the -p and it has all aligned OK but I now have this error: deseq2.R.out_2.txt any ideas? It has actually carried out DESeq2 and I can see volcano plot, MA, plot, heatmap and excel results in /work/97/6b0a537e47d5c7e8ae0b3a863d3e7c/ but the pathway analysis has not been carried out and the DESQ2 results are not in the results folder or final multiQC.

I feel I am close! Thanks so much for all your help so far! Andrea

fischer-hub commented 2 years ago

Hi @AGunnell77, great news that deseq2 is running now!

From the log file you attached it seems that your disk quota is full

  cannot create dir '/home/agunnell/.cache/biomaRt', reason 'Disk quota exceeded'

and that is why the script crashes. Maybe you can delete some files to free up some space in your home directory and try again?

hoelzer commented 2 years ago

... or dont start the pipeline in your home directory! Or point the work directory (-w) and the results folder (--output) to another path w/ enough disk space.

(thx @fischer-hub for the troubleshooting!)

AGunnell77 commented 2 years ago

Hmmm, that's strange. I'm not running from my home directory and I have my cache set elsewhere too. I also added the -w and --output elsewhere on the wrapper but for this stage it still seemed to want to use the home directory as when I cleared space there it has run. Anyway. It's all completed successfully now. Thank you so much for all your help! Andrea

fischer-hub commented 2 years ago

Great that it ran through now! Yes, the biomaRt R package has a cache directory that apparently is set to /home by default so this issue was independent of the output and work directories!

Best wishes!