shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Empty counts files #69

Open renberg opened 2 years ago

renberg commented 2 years ago

I got the pipeline to work with a previous set of sequencing data but now it is finishing with the following error message, and leaving empty count files in the outputs folder. All that I've changed is the data directory, and an updated experiment.csv file to account for the new data. Any idea where the breakdown might be occurring?

Thanks!

executor >  slurm (36)
[e3/8f8bd3] process > create_BAM (make idx)    [100%] 6 of 6 ✔
[f4/661ab1] process > raw_counts (6)           [100%] 6 of 6 ✔
[bd/e76751] process > filter_counts (6)        [100%] 6 of 6 ✔
[01/6d0d95] process > final_counts (6)         [100%] 6 of 6 ✔
[81/009046] process > dna_rna_merge_counts (3) [100%] 3 of 3 ✔
[50/f5d719] process > dna_rna_merge (3)        [100%] 3 of 3 ✔
[5d/ccf914] process > calc_correlations (2)    [  0%] 0 of 3
[61/5f2759] process > make_master_tables (2)   [ 66%] 2 of 3
Error executing process > 'calc_correlations (1)'

Caused by:
  Process `calc_correlations (1)` terminated with an error exit status (1)

Command executed:

  Rscript /home/~~~~~/MPRAflow/src/plot_perInsertCounts_correlation.R 36CRS_1 NA 10 36CRS_1_1_counts.tsv 1

Command exit status:
  1

Command output:
                    File Replicate Condition
  1 36CRS_1_1_counts.tsv         1   36CRS_1
  [1] "hist"
  [1] name      dna_count rna_count ratio     log2      n_obs_bc 
  <0 rows> (or 0-length row.names)
  logical(0)

Command error:

  Attaching package: ‘dplyr’

  The following objects are masked from ‘package:stats’:

      filter, lag

  The following objects are masked from ‘package:base’:

      intersect, setdiff, setequal, union

  Error in hist.default(as.numeric(data1$n_obs_bc), breaks = 100, xlim = c(0,  : 
    character(0)
  Calls: hist -> hist.default
  In addition: Warning messages:
  1: In min(x) : no non-missing arguments to min; returning Inf
  2: In max(x) : no non-missing arguments to max; returning -Inf
  Execution halted

Work dir:
  /home/~~~~~/MPRAflow/work/45/d569c5d4c05016cf996063983a9986

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
visze commented 2 years ago

I suggest that the count file (36CRS_1_1_counts.tsv) is empty. You can find it here: /home/~~~~~/MPRAflow/work/45/d569c5d4c05016cf996063983a9986

So there is an error happening upstream. So you should check all output files of the previous processes to see where the pipeline creates empty counts. My suggestion is that the assignment did not work (process dna_rna_merge). E.g. when the assignment pickle file contains no associations or the wrong associations (e.g. different names that are not in the design file)

visze commented 2 years ago

Any news here? Could you solve it?

renberg commented 2 years ago

Hi, yes and no.

I figured out that my barcode reads had extra bases, and I think that was causing it to not be able to match them to barcodes properly. So I trimmed those and ran it again. Now it fails at this point:

executor >  slurm (36)
[93/066772] process > create_BAM (make idx)    [100%] 6 of 6 ✔
[a0/22f4d8] process > raw_counts (6)           [100%] 6 of 6 ✔
[e5/f97b70] process > filter_counts (6)        [100%] 6 of 6 ✔
[1a/495a96] process > final_counts (6)         [100%] 6 of 6 ✔
[81/a4c2a0] process > dna_rna_merge_counts (2) [100%] 3 of 3 ✔
[20/7485c2] process > dna_rna_merge (1)        [100%] 3 of 3 ✔
[82/55e4be] process > calc_correlations (3)    [100%] 1 of 1, failed: 1
[9e/f430f4] process > make_master_tables (3)   [200%] 2 of 1 ✔
Error executing process > 'calc_correlations (1)'

Caused by:
  Missing output file(s) `*_correlation.txt` expected by process `calc_correlations (1)`

Command executed:

  Rscript /home/~/MPRAflow/src/plot_perInsertCounts_correlation.R 36CRS_2 NA 10 36CRS_2_1_counts.tsv 1

Command error:

  Attaching package: ‘dplyr’

  The following objects are masked from ‘package:stats’:

      filter, lag

  The following objects are masked from ‘package:base’:

      intersect, setdiff, setequal, union

Work dir:
  /home/~/MPRAflow/work/02/01a83548c52f2f6f08770bd1df7500

In a previous post where you aided me I got this error with a different data set and you said it could be ignored and was a result of my not having replicates to calculate correlations with. That is the case with this dataset as well. However, unlike that previous data, the outputs generated by this point don't make any sense. There are large groups of barcodes with identical count results, and in the case of one set of data every RNA count is exactly 25,000 except one that is 50,000 and one that is 100,000. Looking at raw DNA counts showed small whole numbers that were grouped (ie: a handful of barcodes with 4, a handful with 16, a handful with 2) instead of the more continuous distribution we expected.

It is unclear if something is still going wrong in the processing, or if something is weird with our sequencing data itself. Any insights? We are really excited to use this software for some larger scale projects, but so far we haven't gotten it to work. Also happy to correspond by email if you're willing to take the conversation off of your issues page. Thanks in advance for your continued assistance.

visze commented 2 years ago

ok. the first part is because of missing replicates as you already figured out by yourself. But the missing replicates will make the second part very dificult to find out.

This is a really strange behaviour. Maybe you have PCR artefacts. Did you use UMIs? Than you should check the duplication rate.

I am just guessing but I think this large very even numbers like 50,000 or 10,000 are the issue of some counting artefact. Im not sure what exactly causes the issue but e.g. it can be possible that the command uniq -c has an upper limit (e.g. 1000) it will just count the BC 1000 times. And the summing up all barcodes for one oligo you will get BCs per oligo x maximum count.

I really do not know if is an issue wih the assignment (e.g. if most barcodes are assigned to specific oligos) or the with your count data (e.g. PCR artefacts), or your experiment design in general (having elements in taht have an enourmous activity compared to the other. I think this is a point where a standardized workflow cannot help anymore and you have to go deep into the data, check step by step what happes and if the data reflects your expectations.