Closed lotard closed 3 years ago
Hey. Thanks for letting us know. This is strange because the basic count work example workflow should work. But I will rerun this again on our side.
You can define a threshold by using the --thresh
option. But it should not be hard-coded in a script. Thanks for finding that. I will update this soon. I am still working on a new workflow so I can integrate this fix in the release (with the debug of the basic count example).
Solved: experiment.csv needs a slight modification (UMIs are in _2, not _3 files):
Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz
While I'm at it, some small jibes:
fastq-dump --gzip --split-files
after will just download fastqs, not unpack the .sra (there's some other command for that I think)outs/assoc_basic
, rather than Assoc_Basic/output
, which would be consistent with Count workflow (and cleaner)nextflow run association.nf -w Assoc_Basic/work --fastq-insert Assoc_Basic/data/SRR10800986_1.fastq.gz --fastq-insertPE Assoc_Basic/data/SRR10800986_3.fastq.gz --fastq-bc Assoc_Basic/data/SRR10800986_2.fastq.gz --design Assoc_Basic/data/design.fa --outdir Assoc_Basic/output
)In any case, great software, thank you!
version v2.3 uses now the --thresh
in plot_perInsertCounts_correlation.R
and it is not longer hard-coded.
I tried to replicate Basic Count workflow example and it fails at
calc_correlations
stage. It seems like the number of UMIs/BC is so low that threshold filtering (incidentally, this one is hardcoded inplot_perInsertCounts_correlation.R
at 10) incalc_correlations
removed everything.Of note, the numbers of BCs per replicate is also pretty low (output/1-3/SRR..DNA... files have ~400 lines each). Original fastq files seem fine with >20M reads each (grep -c '^@'). So either they aren't mapping well or some other filtering process removes many of them. Any ideas where to look? (filter_counts?)
I've also noticed that "DNA_freqUMIs" fails to be written at some point, but it doesn't fail any scripts and it's downstreams of writting output count tables.
I'm using slurm as scheduler for the bulk of the run. My home directory is different than the scratch on which I'm storing packages and environments and running analyses.