shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Different "Process `create_BAM` input file name collision" issue #66

Closed renberg closed 2 years ago

renberg commented 2 years ago

I saw someone else producing the same error, but mine appears to be different so I'm hoping someone can help. I've been struggling to get this pipeline to work for a few weeks now. Here is my output from the latest failed run:

> N E X T F L O W  ~  version 22.04.3
> Launching `count.nf` [ridiculous_murdock] DSL1 - revision: 720e5db844
> WARN: Access to undefined parameter `version` -- Initialise it to a default value eg. `params.version = some_value`
> =======================================================
>                                           ,--./,-.
>           ___     __   __   __   ___     /,-._.--~'
>     |\ | |__  __ /  ` /  \ |__) |__         }  {
>     | \| |       \__, \__/ |  \ |___     \`-._,-`-,
>                                           `._,._,'
> MPRAflow vnull"
> =======================================================
> Pipeline Name  : shendurelab/MPRAflow
> Pipeline Version: null
> Run Name       : ridiculous_murdock
> Output dir     : 36crs/outputs
> Working dir    : /home/renberg/MPRAflow/work
> Current home   : /home/renberg
> Current user   : renberg
> Current path   : /home/renberg/MPRAflow
> Script dir     : /home/renberg/MPRAflow
> Config Profile : standard
> Experiment File: /home/renberg/MPRAflow/36crs/36crs_experiment.csv
> reads          : DataflowQueue(queue=[])
> UMIs           : Reads with UMI
> BC length      : 15
> BC threshold   : 10
> mprAnalyze     : false
> =========================================
> WARN: Access to undefined parameter `nf_required_version` -- Initialise it to a default value eg. `params.nf_required_version = some_value`
> ====================================================
>   Nextflow version null required! You are running v22.04.3.
>   Pipeline execution will continue, but things may break.
>   Please run `nextflow self-update` to update Nextflow.
> ============================================================
> 
> 
> start analysis
> [-        ] process > create_BAM           -
> [-        ] process > raw_counts           -
> [-        ] process > filter_counts        -
> [-        ] process > final_counts         -
> [-        ] process > dna_rna_merge_counts -
> [-        ] process > dna_rna_merge        -
> [-        ] process > calc_correlations    -
> [-        ] process > make_master_tables   -
> 
> [-        ] process > create_BAM           -
> [-        ] process > raw_counts           -
> [-        ] process > filter_counts        -
> [-        ] process > final_counts         -
> [-        ] process > dna_rna_merge_counts -
> [-        ] process > dna_rna_merge        -
> [-        ] process > calc_correlations    -
> [-        ] process > make_master_tables   -
> Error executing process > 'create_BAM (make idx)'
> 
> Caused by:
>   Process `create_BAM` input file name collision -- There are multiple input files for each of the following file names: null
> 
> 
> Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
> 
> 
> 
> [-        ] process > create_BAM           -
> [-        ] process > raw_counts           -
> [-        ] process > filter_counts        -
> [-        ] process > final_counts         -
> [-        ] process > dna_rna_merge_counts -
> [-        ] process > dna_rna_merge        -
> [-        ] process > calc_correlations    -
> [-        ] process > make_master_tables   -
> Error executing process > 'create_BAM (make idx)'
> 
> Caused by:
>   Process `create_BAM` input file name collision -- There are multiple input files for each of the following file names: null
> 
> 
> Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Here is the command I am running (as part of a slurm script): nextflow run count.nf --experiment-file "36crs/36crs_experiment.csv" --dir "36crs/6146-AR/fastqs_6146-AR" --outdir "36crs/outputs" --association "36crs/fake_assoc_dict.p" --design "36crs/data/CRS.fa"

For additional possibly pertinent background, I made the association and design files from scratch instead of using the association function because we used known unique barcodes for our 36 inserts and did not need to do an initial sequencing run for barcode association.

Thanks for your help.

visze commented 2 years ago

Hi

Well there seems to be multiple issues. First of all it tells you that you run pipeline version null... This should be at something like 2.3.2. it is defined in the global config file So which version of mpraflow you are running?

Then it seems there is an issue with your experiment file. Nextflow is not able to create runs and replicates from it. Can you post this file?

Using your own association file is absolutely fine and not the error.

renberg commented 2 years ago

I noticed the null version thing too and don't know what to make of it. Where can I find the global config file? Sorry, not much of a computer sciences background, really appreciate the help.

Here's my experiment file, we only have a single replicate for this small scale study. 36crs_experiment.csv

visze commented 2 years ago

Thank you for providing the experiment file. It tels me that your headre is false. Have a look in the documentation. It should be exactly: Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R. The ordering is not important (see code here) but the names! To I think your order is R1,R2,I1 So you can use the header Condition,Replicate,DNA_BC_F,DNA_BC_R,DNA_UMI,RNA_BC_F,RNA_BC_R,RNA_UMI

Again the question. which version you use? Wat did you downloaded or checked out? The conf/gobal.conf file should show the version.

renberg commented 2 years ago

Thanks for pointing out my experiment file mistake. I was using the header from your Nature communications paper example.

Found global config file, it says MPRAflow version 2.3.5, and nextflow version required 20.10

renberg commented 2 years ago

Experiment file seems to have fixed that problem, it runs now!

However, it stopped toward the end and gave me the following:

WARN: Killing running tasks (1)
executor >  slurm (12)
[1a/ec4274] process > create_BAM (make idx)    [100%] 2 of 2 ✔
[b4/27ea54] process > raw_counts (1)           [100%] 2 of 2 ✔
[4c/aa988c] process > filter_counts (2)        [100%] 2 of 2 ✔
[9c/e6dc52] process > final_counts (2)         [100%] 2 of 2 ✔
[da/112811] process > dna_rna_merge_counts (1) [100%] 1 of 1 ✔
[38/b0feb5] process > dna_rna_merge (1)        [100%] 1 of 1 ✔
[9e/c4adc6] process > calc_correlations (1)    [  0%] 0 of 1
[ff/c340fa] process > make_master_tables (1)   [  0%] 0 of 1
Error executing process > 'calc_correlations (1)'

Caused by:
  Missing output file(s) `*_correlation.txt` expected by process `calc_correlations (1)`

Command executed:

  Rscript /home/renberg/MPRAflow/src/plot_perInsertCounts_correlation.R 36CRS NA 10 36CRS_1_counts.tsv 1

Command exit status:
  0

Command output:
                  File Replicate Condition
  1 36CRS_1_counts.tsv         1     36CRS
  [1] "hist"
                      name dna_count  rna_count     ratio        log2 n_obs_bc
  1           MiniPromoter  7961.529   4659.243 0.5852197 -0.77294985        1
  2                  SCP_1 17874.379  17406.864 0.9738444 -0.03823685        1
  3           CMV_Promoter 13447.450   5440.661 0.4045868 -1.30547875        1
  4    TNNT2_Promoter_Full 23967.918 108108.753 4.5105609  2.17330684        1
  5 TNNT2_Promoter_Minimal 11586.403  14480.612 1.2497935  0.32168979        1
  6  TNNT2_Promoter_Micro1 21363.842  17141.334 0.8023526 -0.31769173        1
  [1] 1 1 1 1 1 1
  [1] "boxplot"
                      name                log2
  1                                           
  2           MiniPromoter  -0.772949852586751
  3                  SCP_1 -0.0382368450302552
  4           CMV_Promoter   -1.30547874517947
  5    TNNT2_Promoter_Full    2.17330684394993
  6 TNNT2_Promoter_Minimal   0.321689790304395
  [1] "merged"
                      name                log2 label
  1                                               NA
  2           MiniPromoter  -0.772949852586751    NA
  3                  SCP_1 -0.0382368450302552    NA
  4           CMV_Promoter   -1.30547874517947    NA
  5    TNNT2_Promoter_Full    2.17330684394993    NA
  6 TNNT2_Promoter_Minimal   0.321689790304395    NA
                                             name       log2 label
  25       TNNT2_Promoter_Minimal_MYBPC3intronEnh -2.1637874    NA
  14                          ACTC1_Promoter_Full -1.8552410    NA
  15                       ACTC1_Promoter_Minimal -1.5105718    NA
  4                                  CMV_Promoter -1.3054787    NA
  36 TNNT2_Promoter_Micro3withPromTFBSinIntronEnh -0.8386609    NA
  2                                  MiniPromoter -0.7729499    NA
  'data.frame': 36 obs. of  3 variables:
   $ name : Factor w/ 37 levels "","ACTC1_Promoter_Full",..: 34 2 3 7 19 8 18 17 5 21 ...
   $ log2 : num  -2.164 -1.855 -1.511 -1.305 -0.839 ...
   $ label: Factor w/ 1 level "NA": 1 1 1 1 1 1 1 1 1 1 ...
  NULL
  png 
    2 
  png 
    2 

Command error:

  Attaching package: ‘dplyr’

  The following objects are masked from ‘package:stats’:

      filter, lag

  The following objects are masked from ‘package:base’:

      intersect, setdiff, setequal, union

Am I correct that it is upset that I only have one replicate, so it gets confused when it tries to calculate correlation across replicates? It seems to have produced all of the files in the output folder for the replicate, but not the various plots that would show all replicates if there were more than one.

More importantly, Are there any critical steps toward the end of the pipeline that could affect my data, or can I just use what it gave me in the replicate folder?

Thanks again!!!

visze commented 2 years ago

yes. I think so, too. In teh past we did not have data without replicates.

The missing two steps are not important. You should have everything you need. The correlations plots are missing (but cannot be computed becasue of missing repliactes) and the master table. Which is a table that combine counts, BCs and expression fold changes from all replicates.