mckennalab / SingleCellLineage

Updated scripts and pipelines for processing GESTALT data at single-cell resolution
19 stars 8 forks source link

Error in 10X Example: 'umiMemLimit' isn't defined #4

Closed oligomyeggo closed 3 years ago

oligomyeggo commented 3 years ago

Hello! Thanks for this great analysis pipeline. I am looking forward to using it in an upcoming 10X experiment, and was working through your provided 10X example when I ran into this error:

root@e71497f2cc8e:/app/my_test_run# bash test_run.sh
INFO  23:21:09,495 QScriptManager - Compiling 1 QScript 
INFO  23:21:14,152 QScriptManager - Compilation complete 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.utils.commandline.InvalidArgumentException: 
Argument with name 'umiMemLimit' isn't defined.
    at org.broadinstitute.gatk.utils.commandline.ParsingEngine.validate(ParsingEngine.java:306)
    at org.broadinstitute.gatk.utils.commandline.ParsingEngine.validate(ParsingEngine.java:279)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:216)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
    at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-ge91472d):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Argument with name 'umiMemLimit' isn't defined.
##### ERROR ------------------------------------------------------------------------------------------
INFO  23:21:14,285 QCommandLine - Shutting down jobs. Please wait... 
root@e71497f2cc8e:/app/my_test_run# ls -l

I see it's being defined here: https://github.com/mckennalab/SingleCellLineage/blob/6645b0dbc100970bce4ce3c41885830be72d6441/pipelines/CRISPR_analysis_PE_V2.scala#L128 and is set to 4 in the provided test_run.sh script, so I am not sure what's going wrong.

Any insight as to what is going on here? Thanks in advance!

aaronmck commented 3 years ago

Hi Caitlin,

Sorry for the slow reply here (and great github username btw). I haven't seen this before, but it totally could be some out of date material in the 10X setup script. Are you using the stock test run script or have you altered it for your data? I'll also make sure I can run the example on a clean docker pull. Thanks!

oligomyeggo commented 3 years ago

Hi Aaron,

Yes, I am using the stock test run script and just trying to get everything up and running on my computer before tweaking anything/trying it with my own data. After the initial attempt failed, I did try poking around a little and deleted the umiMemLimit argument from the test_run.sh script just to see what happened and got the following error:

root@e71497f2cc8e:/app/my_test_run# bash test_run.sh
INFO  17:43:31,841 QScriptManager - Compiling 1 QScript 
INFO  17:43:38,206 QScriptManager - Compilation complete 
INFO  17:43:38,373 HelpFormatter - ---------------------------------------------------------------------- 
INFO  17:43:38,373 HelpFormatter - Queue v3.5-0-ge91472d, Compiled 2015/12/21 04:10:16 
INFO  17:43:38,374 HelpFormatter - Copyright (c) 2012 The Broad Institute 
INFO  17:43:38,374 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  17:43:38,376 HelpFormatter - Program Args: -S /app/sc_GESTALT/pipelines/CRISPR_analysis_PE_V2.scala -i /app/my_test_run/data/tol2_simulated_data_tear_sheet.txt --aggLocation /app/my_test_run/data/pipeline_output/ --expName my_test_data --eda /app/EDNAFULL.Ns_are_zero -run --dontTrim --primersToUse FORWARD --umiIndex 10X -s /app/sc_GESTALT/scripts/ -b /app/bin/ --dontweb --scala /usr/bin/scala -nocompdaemon --minimumUMIReads 4 --minimumSurvivingUMIReads 3 --umiLength 28 
INFO  17:43:38,377 HelpFormatter - Executing as root@e71497f2cc8e on Linux 4.19.121-linuxkit amd64; OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09. 
INFO  17:43:38,377 HelpFormatter - Date/Time: 2021/01/06 17:43:38 
INFO  17:43:38,378 HelpFormatter - ---------------------------------------------------------------------- 
INFO  17:43:38,379 HelpFormatter - ---------------------------------------------------------------------- 
INFO  17:43:38,395 QCommandLine - Scripting DNAQC 
INFO  17:43:38,434 QCommandLine - Done with errors 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.utils.exceptions.UserException$CannotExecuteQScript: Unable to execute QScript: DNAQC.script() threw the following exception: java.lang.IllegalStateException: Unknown UMI index: 10X
    at org.broadinstitute.gatk.queue.QCommandLine$$anonfun$execute$5.apply(QCommandLine.scala:158)
    at org.broadinstitute.gatk.queue.QCommandLine$$anonfun$execute$5.apply(QCommandLine.scala:146)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:146)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
    at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
Caused by: java.lang.IllegalStateException: Unknown UMI index: 10X
    at org.broadinstitute.gatk.queue.qscripts.DNAQC$$anonfun$script$1.apply(CRISPR_analysis_PE_V2.scala:410)
    at org.broadinstitute.gatk.queue.qscripts.DNAQC$$anonfun$script$1.apply(CRISPR_analysis_PE_V2.scala:260)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at org.broadinstitute.gatk.queue.qscripts.DNAQC.script(CRISPR_analysis_PE_V2.scala:260)
    at org.broadinstitute.gatk.queue.QCommandLine$$anonfun$execute$5.apply(QCommandLine.scala:155)
    ... 10 more
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-ge91472d):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Unable to execute QScript: DNAQC.script() threw the following exception: java.lang.IllegalStateException: Unknown UMI index: 10X
##### ERROR ------------------------------------------------------------------------------------------
INFO  17:43:38,451 QCommandLine - Shutting down jobs. Please wait...

No idea if that is helpful or clarifies anything further. Thanks so much for taking a look into this!

aaronmck commented 3 years ago

Sorry! The Docker stable container is a little out-of-date vs the code which can be confusing. I've updated the latest container, and changed the links on the main page to point to that container instead. I've also pushed some other bug fix changes that should be in there, and tested that the umiMemLimit command seems to work. Let me know if this does fix things, and thanks for trying it out!

oligomyeggo commented 3 years ago

Yep, it's working now! Thank you!! As a follow-up question, just to make sure that I understand the overall process with this pipeline: with inDrops data you have to pre-process the scGESTALT data prior to running the pipeline so that you split a fastq file into individual fastq files for each cell, but with 10X data you don't have to split the fastq files?

And then you should be able to take the .stats output file and merge that with the 10X transcriptome data using a modified version of the MatchPipeFunc and Transcriptome-scGestaltMatchPipe files provided in the Raj et al., 2018 methods (I am assuming they would need to be modified to handle 10X data vs inDrops data; I haven't had a chance to dig into them too much yet)?

aaronmck commented 3 years ago

You shouldn't have to preprocess individual samples into cells ahead of time (for either 10x or inDrops). If you have one set of fastq files for many samples, you use the built-in barcode splitter by specifying the same fastq files for each sample with their barcode sequences in the appropriate barcode column. Otherwise just list one unique set of fastq files for each sample and leave the last two columns as ALL.

You're right about the stats file; when the pipeline is done you'll have the resulting cell ID and UMI encoded into the read name (the first column) of the stats file, which you'll then have to parse out and map to the cell IDs from your transcriptional data. I haven't tried Bursha's scripts, but the process shouldn't be too bad, and if you get stuck at that point let me know. Good luck!