example: calc_correlations fails due to low UMI/BC numbers

lotard commented 3 years ago

I tried to replicate Basic Count workflow example and it fails at calc_correlations stage. It seems like the number of UMIs/BC is so low that threshold filtering (incidentally, this one is hardcoded in plot_perInsertCounts_correlation.R at 10) in calc_correlations removed everything.

Of note, the numbers of BCs per replicate is also pretty low (output/1-3/SRR..DNA... files have ~400 lines each). Original fastq files seem fine with >20M reads each (grep -c '^@'). So either they aren't mapping well or some other filtering process removes many of them. Any ideas where to look? (filter_counts?)

I've also noticed that "DNA_freqUMIs" fails to be written at some point, but it doesn't fail any scripts and it's downstreams of writting output count tables.

I'm using slurm as scheduler for the bulk of the run. My home directory is different than the scratch on which I'm storing packages and environments and running analyses.

MPRAflow v2.2"
=======================================================
Pipeline Name  : shendurelab/MPRAflow
Pipeline Version: 2.2
Run Name       : hopeful_kalman
Output dir     : Count_Basic/output
Working dir    : <...>/MPRAflow/Count_Basic/work
Current home   : <home>
Current user   : <user>
Current path   : <...>/MPRAflow
Script dir     : <...>/MPRAflow
Config Profile : standard
Experiment File: <...>/MPRAflow/Count_Basic/data/experiment.csv
reads          : DataflowQueue(queue=[])
UMIs           : Reads with UMI
BC length      : 15
BC threshold   : 10
mprAnalyze     : false
=========================================
start analysis

executor >  local (2)
[14/13d43b] process > create_BAM           [100%] 6 of 6, cached: 6 ✔
[60/2feb24] process > raw_counts           [100%] 6 of 6, cached: 6 ✔
[fa/8073af] process > filter_counts        [100%] 6 of 6, cached: 6 ✔
[37/03f11f] process > final_counts         [100%] 6 of 6, cached: 6 ✔
[1d/f2162a] process > dna_rna_merge_counts [100%] 3 of 3, cached: 3 ✔
[8a/b655c6] process > dna_rna_merge        [100%] 3 of 3, cached: 3 ✔
[71/24b610] process > calc_correlations    [100%] 1 of 1, failed: 1 ✘
[12/cb31f1] process > make_master_tables   [100%] 1 of 1 ✔
Error executing process > 'calc_correlations (1)'

Caused by:
  Process `calc_correlations (1)` terminated with an error exit status (1)

Command executed:

  Rscript <...>/MPRAflow/src/plot_perInsertCounts_correlation.R HEPG2 NA HEPG2_2_counts.tsv HEPG2_3_counts.tsv HEPG2_1_counts.tsv 2 3 1

Command exit status:
  1

Command output:
                  File Replicate Condition
  1 HEPG2_2_counts.tsv         2     HEPG2
  2 HEPG2_3_counts.tsv         3     HEPG2
  3 HEPG2_1_counts.tsv         1     HEPG2
  [1] "sel"
       [,1] [,2] [,3]
  [1,] 2    2    3   
  [2,] 3    1    1   
  Levels: 1 2 3
  [1] "reps"
  [1] "selected replicates: "
  [1] 2 3
  Levels: 1 2 3
  [1] "tables loaded"
  [1] "nrow data1: 616"
  [1] "nrow data2: 784"
                                                                                                 name
  1                               R:FOXA1-ChMod_chr11:47026382-47026514__chr11:47026362-47026533_:072
  2                           R:EP300-NoMod_chr14:101143986-101144157__chr14:101143986-101144157_:078
  3                               R:HNF4A-NoMod_chr20:35089751-35089903__chr20:35089741-35089912_:003
  4                               R:HNF4A-ChMod_chr17:66774221-66774349__chr17:66774199-66774370_:095
  5 C:SLEA_hg18:chr9:82902419-82902586|47:V_CEBP_Q2_01:ATTGCACAATCC;109:V_CEBP_Q2_01:ATTGCACAATCC:029
  6                                       A:HNF4A-ChMod_chr16:559346-559465__chr16:559320-559491_:071
    dna_count rna_count     ratio        log2 n_obs_bc
  1  1070.664 1095.2903 1.0230011  0.03280769        1
  2  1070.664 1095.2903 1.0230011  0.03280769        1
  3  1070.664 1095.2903 1.0230011  0.03280769        1
  4  1427.552  730.1935 0.5115005 -0.96719231        2
  5  2141.328  547.6451 0.2557503 -1.96719231        1
  6  1070.664  821.4677 0.7672508 -0.38222981        3
                                                                       name
  1       R:FOXA2-NoMod_chr1:43501797-43501920__chr1:43501773-43501944_:067
  2   R:FOXA1-NoMod_chr6:155809024-155809123__chr6:155808988-155809159_:011
  3       R:FOXA2-ChMod_chr7:47539852-47540021__chr7:47539851-47540022_:074
  4 R:HNF4A-ChMod_chr12:100831288-100831452__chr12:100831284-100831455_:043
  5     R:FOXA1-ChMod_chr13:22412044-22412188__chr13:22412030-22412201_:085
  6     R:EP300-ChMod_chr21:36207977-36208148__chr21:36207977-36208148_:062
    dna_count rna_count     ratio        log2 n_obs_bc
  1  626.9592  596.1252 0.9508197 -0.07275634        4
  2  522.4660  745.1565 1.4262295  0.51220616        2
  3 1567.3981  372.5782 0.2377049 -2.07275634        1
  4 1044.9321  496.7710 0.4754098 -1.07275634        2
  5 1044.9321  496.7710 0.4754098 -1.07275634        2
  6  522.4660  745.1565 1.4262295  0.51220616        2
  [1] "filtered"
  [1] "nrow filtered data1: 0"
  [1] "nrow filtered data2: 0"
   [1] name        dna_count.x rna_count.x ratio.x     log2.x      n_obs_bc.x 
   [7] dna_count.y rna_count.y ratio.y     log2.y      n_obs_bc.y 
  <0 rows> (or 0-length row.names)

Command error:

  Attaching package: 'dplyr'

  The following objects are masked from 'package:stats':

      filter, lag

  The following objects are masked from 'package:base':

      intersect, setdiff, setequal, union

  Error in `$<-.data.frame`(`*tmp*`, "label", value = "NA") : 
    replacement has 1 row, data has 0
  Calls: $<- -> $<-.data.frame
  Execution halted

Work dir:
  <...>/MPRAflow/Count_Basic/work/71/24b6107c4e71277747d2a529d15eaa

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

visze commented 3 years ago

Hey. Thanks for letting us know. This is strange because the basic count work example workflow should work. But I will rerun this again on our side.

You can define a threshold by using the --thresh option. But it should not be hard-coded in a script. Thanks for finding that. I will update this soon. I am still working on a new workflow so I can integrate this fix in the release (with the debug of the basic count example).

lotard commented 3 years ago

Solved: experiment.csv needs a slight modification (UMIs are in _2, not _3 files):

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz

While I'm at it, some small jibes:

both Assoc and Count example suggest prefetch which gets .sra file, but running fastq-dump --gzip --split-files after will just download fastqs, not unpack the .sra (there's some other command for that I think)
it's stated that workflow will only work on Linux, but it does a good job of installing on Mac, untill actual run, where it finds out some dependencies are linux-only (yeah, my bad...) - if possible, a warning message on installation would help save time for hopeless coders like me
outputs for Assoc workflow are deposited in outs/assoc_basic, rather than Assoc_Basic/output, which would be consistent with Count workflow (and cleaner)
could be nice if run commands suggested full paths, but it's a cherry on top (ie nextflow run association.nf -w Assoc_Basic/work --fastq-insert Assoc_Basic/data/SRR10800986_1.fastq.gz --fastq-insertPE Assoc_Basic/data/SRR10800986_3.fastq.gz --fastq-bc Assoc_Basic/data/SRR10800986_2.fastq.gz --design Assoc_Basic/data/design.fa --outdir Assoc_Basic/output)
could be nice to include insert labels, now it's all NAs
could be nice to mention each workflow example (Assoc and Count) require some 50GB space for raw files and temps, not to mention conda files

In any case, great software, thank you!

visze commented 3 years ago

version v2.3 uses now the --thresh in plot_perInsertCounts_correlation.R and it is not longer hard-coded.

shendurelab / MPRAflow

example: calc_correlations fails due to low UMI/BC numbers #37