kharchenkolab / dropEst

Pipeline for initial analysis of droplet-based single-cell RNA-seq data
GNU General Public License v3.0
88 stars 42 forks source link

FillCollisionsAdjustmentInfo is taking too long to complete #51

Open k3yavi opened 6 years ago

k3yavi commented 6 years ago

Hi @VPetukhov , I am trying to use DropEst with one of the 10x datasets from here. I downloaded the BAM from the above link and used dropest -f mode to generate the count matrix and the rds file for the downstream UMI correction. Unfortunately I am noob in R and just copy pasting code from this tutorial. I have got a couple of question regarding the working of dropEst pipeline:

  1. When we provide an external BAM (like from 10x website in my case) what type of UMI correction does dropest command do before generating the count matrix? I am guessing it simply count the tags from the BAM if I don't give either -m or -M?
  2. When I try to use dropestr R package following the above tutorial it seems to give good umi distribution till here. But once I try to use the function
    collisions_info <- FillCollisionsAdjustmentInfo(umi_probabilities, max_umi_per_gene, step=20, mc.cores=5, verbose=T)

    the program give following error

    > FillCollisionsAdjustmentInfo(umi_probabilities, max_umi_per_gene, step=20, mc.cores=5, verbose=T)
    Error in FillCollisionsAdjustmentInfo(umi_probabilities, max_umi_per_gene,  :
    unused arguments (step = 20, mc.cores = 5, verbose = T)

    So I removed step = 20, mc.cores = 5, verbose = T but it has been 3 days now and the program won't finish.

What am I doing wrong? Thanks again for your help.

my config.xml

<config>
    <!-- droptag -->
    <TagsSearch>
        <protocol>10x</protocol>
        <BarcodesSearch>
            <barcode1_length>8</barcode1_length>
            <barcode2_length>16</barcode2_length>
            <umi_length>10</umi_length>
            <r1_rc_length>0</r1_rc_length>
        </BarcodesSearch>

        <Processing>
            <min_align_length>10</min_align_length>
            <reads_per_out_file>10000000</reads_per_out_file>
            <poly_a_tail>AAAAAAAA</poly_a_tail>
        </Processing>
    </TagsSearch>

    <!-- dropest -->
    <Estimation>
        <Merge>
        <barcodes_file>/mnt/scratch5/avi/dropest_data/barcodes.tsv</barcodes_file>
            <barcodes_type>const</barcodes_type>
            <min_merge_fraction>0.2</min_merge_fraction>
            <max_cb_merge_edit_distance>2</max_cb_merge_edit_distance>
            <max_umi_merge_edit_distance>1</max_umi_merge_edit_distance>
            <min_genes_after_merge>100</min_genes_after_merge>
            <min_genes_before_merge>20</min_genes_before_merge>
        </Merge>

        <PreciseMerge>
            <max_merge_prob>1e-5</max_merge_prob>
            <max_real_merge_prob>1e-7</max_real_merge_prob>
        </PreciseMerge>
    </Estimation>

    <BamTags> <!-- Optional. Tags, which are used to parse .bam file (-f option) or to print tagged .bam file (-b or -F options). Default values correspond to 10x protocol. -->
        <cb>CB</cb> <!-- Cell barcode. Default: CB. -->
        <cb_raw>CR</cb_raw> <!-- Cell barcode raw. Used only for bam output. Default: CR. -->
        <umi>UB</umi> <!-- UMI. Default: UB. -->
        <umi_raw>UR</umi_raw> <!-- UMI raw. Used only for bam output. Default: UR. -->
        <gene>GX</gene> <!-- Gene id. Default: GX. -->
        <cb_quality>CQ</cb_quality> <!-- Cell barcode quality. Default: CQ. -->
        <umi_quality>UQ</umi_quality> <!-- UMI quality. Default: UQ. -->
    </BamTags>
</config>
VPetukhov commented 6 years ago

Hi @k3yavi , Ohm, this function should work really fast. Can you please give me values of max_umi_per_gene and length(umi_probabilities)?

k3yavi commented 6 years ago

Hi @VPetukhov , The experiment has the following numbers:

> max_umi_per_gene
[1] 4192617
> length(umi_probabilities)
[1] 1048576
VPetukhov commented 6 years ago

4 millions UMIs per gene? It's definitely not a correct number. Can you please publish your dropEst log?

k3yavi commented 6 years ago

Unfortunately I can't find the log of the run, but I just used the bam from the 10x website and the config as attached above. Attaching the results.html file.

report.html.zip

k3yavi commented 5 years ago

Hi I was wondering is there any luck resolving this issue ?

VPetukhov commented 5 years ago

There were several problems with UMI correction, but now it should work. Please, check the result I published for neurons_900. I fixed your config, so the pipeline reads non-corrected UMI and their quality.

k3yavi commented 5 years ago

Thanks @VPetukhov for looking into it, I'll try running the tool again and report back.

k3yavi commented 5 years ago

@VPetukhov , With the latest version of dropEst, I get the following error.

/mnt/scratch5/avi/alevin/bin/dropest/dropEst/build/dropest -f -c est_config.dump.xml -C 1200 ./possorted_genome_bam.bam
Version: 0.8.5.
Run: 02/02/2019 16:27:07.
No such node (max_cb_merge_edit_distance)

I am using the same xml file as you shared for the neurons_900 dataset in https://github.com/hms-dbmi/dropEst/issues/68 and got the bam from the 10x website.

The weirdest thing is that, with every run of drpoEst, xml file gets cleared out. I double checked for the content of the xml file before giving that as an input to dropEst, but the program fails with the above error.

Thanks in advance for your help.

k3yavi commented 5 years ago

Oki I have a workaround, if I make the xml read only then it seems to work fine, at least till the generation of the rds file. Now checking dropEstr for the UMI correction.

k3yavi commented 5 years ago

I can use dropestr to generate UMI corrected counts now. Thanks for the help.

VPetukhov commented 5 years ago

The weirdest thing is that, with every run of drpoEst, xml file gets cleared out.

If you used xml from the same folder and the same name as in the archive, it's normal behavior. It's a dump of the config file, and the pipeline dumps it again. So you just need to move / rename it.

k3yavi commented 5 years ago

ah! got it. Thanks, for the clarification .

k3yavi commented 5 years ago

I think I closed this issue a little too early without properly going through the logs, sorry for that. But in my defense the output is getting generated but with the following warning, which I missed earlier

Warning message:
In parallel::mclapply(..., mc.cores = GetMcCores(mc.cores)) :
  all scheduled cores encountered errors in user code

I am not sure how does this warning effects the pipeline and the output count matrix. I tried to use only 1 core to avoid multithreading and the program fails with the following error:

Correcting UMI sequence errors.
Estimating UMIs distribution... Completed.
Filling collisions info...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Completed.
Filling info about adjacent UMIs...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Done!
Correction info prepared!

Estimating prior error probabilities... Completed.
Correcting UMI sequence errors...Error in `$<-.data.frame`(`*tmp*`, "IsMerged", value = logical(0)) :
  replacement has 0 rows, data has 4
Calls: CorrectUmiSequenceErrors ... lapply -> FUN -> PredictBayesian -> $<- -> $<-.data.frame
Execution halted

I tried using step by step procedure instead of the quick one too. In that case I get the following error:

> cm <- CorrectUmiSequenceErrors(reads_per_umi_per_cell, umi.probabilities=umi_probabilities, collisions.info=collision_info,
+                                correction.info=correction_info, mc.cores=1, verbosity.level=2)
Correcting UMI sequence errors.

Estimating prior error probabilities... Completed.
Correcting UMI sequence errors...Error in GetSmallerNeighboursDistributionsBySizes(dp.matrices, larger.nn,  :
  _Map_base::at
>
evanbiederstedt commented 4 years ago

Hey @k3yavi

If you could provide more information so this could be debugged, I would appreciate it.

Thanks, Evan