lilab-bcb / cumulus

Cloud-based scalable and efficient single-cell genomics workflows
https://cumulus.readthedocs.io
BSD 3-Clause "New" or "Revised" License
55 stars 30 forks source link

CommandException: No URLs matched #133

Closed slowkow closed 3 years ago

slowkow commented 3 years ago

Problem

I ran the cellranger_workflow version 14 and Cell Ranger version 4.0.0.

Here's the error I get:

CommandException: No URLs matched: gs://fc-secure-abc-123/steve/project/2020-12-22/data_output/ABC123_fastqs/fastq_path/H2YNLBGXH/project_mgh_3_gex
CommandException: 1 file/object could not be transferred.

Here's the failed job, in case you want to have a look for yourself: https://job-manager.dsde-prod.broadinstitute.org/jobs/95552f84-77fd-4174-8561-99afb9ea9771

This error occurs because cellranger_count is looking for a folder that is called {sample_id}, in this case "project_mgh_3_gex".

But a folder with the name sample_id is not created by the cellranger_mkfastq step. Is it possible that cellranger mkfastq did not make the {sample_id} folder for some reason? For example, could it be because the sample sheet had only 1 sample?

I can see that cellranger_mkfastq created FASTQ files like {sample_id}_S*.fastq.gz without making a folder called {sample_id}.

Here is the code where cellranger_count is looking for the folder {sample_id}:

https://github.com/klarman-cell-observatory/cumulus/blob/06968583beabbc25684e51fa74761b90593a3028/workflows/cellranger/cellranger_count.wdl#L107-L112

Solution

Maybe this code should be changed? It seems that this code is checking for the situation where we have {sample_id}_S*.fastq.gz filenames, and then it makes a new folder called {sample_id} and moves the FASTQ files into it.

https://github.com/klarman-cell-observatory/cumulus/blob/06968583beabbc25684e51fa74761b90593a3028/workflows/cellranger/cellranger_mkfastq.wdl#L109-L121

slowkow commented 3 years ago

Is this issue related to the "Dual Index"?

https://support.10xgenomics.com/single-cell-gene-expression/sequencing/doc/specifications-sample-index-sets-for-single-cell-3

This particular job has a sample sheet with two values for the column Index:

SI-TT-A1
SI-NT-A1
slowkow commented 3 years ago

I would suggest changing this special if-statement:

https://github.com/klarman-cell-observatory/cumulus/blob/06968583beabbc25684e51fa74761b90593a3028/workflows/cellranger/cellranger_mkfastq.wdl#L109-L121

Right now it is looking for samples where Index was not set. We infer that Index is not set when it is missing the - character.

Instead, the code should look for FASTQ files that are not in a {sample_id} folder. If files are found, then a new {sample_id} folder should be created and the files moved there.

In other words, the code inside the if-statement should always be running, regardless of whether or not Index contains the - character.

majorkazer commented 3 years ago

Hello,

I also got a similar exception after FASTQ generation in the call-collect-summaries step, where it seems that the pipeline is not creating the sample folder (as suggested above?).

CommandException: No URLs matched: gs://fc-secure-a6aa0703-c003-48d2-bfaf-557d70adefb1/201230_NB501935_0898_AH7H7HBGXF/cellranger_output/nasal_mucosa_test2_rna/metrics_summary.csv

I am running this on data from a NextSeq. Interestingly, when I submitted using a sample sheet that specified only 1 for Lane for this sample and its associated antibody hash, the pipeline didn't throw any error. Now that I am trying to run it with Lane set to 1-4 or *, I am getting this error.

It's clear that the folder and this file do not exist. The pipeline only spits out FASTQs that are separated by sample and lane.

Thanks for any insight!

slowkow commented 3 years ago

It sounds like you also have the situation where one step looks for a folder that was not created by the previous step.

To resolve this particular issue, I manually created the missing folders with the gsutil tool and moved the files by myself. Then the rest of the workflow continued without errors.

CommandException: No URLs matched: gs://fc-abc-123/steve/project/2020-12-22/output/201222_NB551582_0040_AH2YNLBGXH_fastqs/fastq_path/H2YNLBGXH/project_mgh_3_gex
CommandException: 1 file/object could not be transferred.

In this case I had files like this:

fastq_path/H2YNLBGXH/{sample_id}_S*_L*_*_001.fastq.gz

But cellranger_count was looking for:

fastq_path/H2YNLBGXH/{sample_id}/{sample_id}_S*_L*_*_001.fastq.gz
majorkazer commented 3 years ago

Maybe a separate issue, but is the workflow built to collapse samples split across lanes (and thus in separate FASTQs) during the cellranger_countor call-collect-summaries? Or maybe this problem could be avoided if I demultiplexed separately to only have one FASTQ with the data from all lanes merged?

bli25 commented 3 years ago

Hi @slowkow @majorkazer ,

I can confirm the issue was caused by dual index. We have fixed this issue in the newly released Cumulus 1.2.0 (https://cumulus.readthedocs.io/en/latest/index.html). Please give it a try!