mckennalab / SingleCellLineage

Updated scripts and pipelines for processing GESTALT data at single-cell resolution
19 stars 8 forks source link

Function edge error and tree construction #3

Closed qinqian closed 3 years ago

qinqian commented 3 years ago

Hi Aaron,

Thank you for presenting such a wonderful software and experiment. We've got 93 datasets and 91 of them have been successfully processed by the SingleCellLineage. Only 2 samples failed at the step FunctionEdge, here are the logs:

Sample 1:

ERROR 04:31:02,045 FunctionEdge - Error: /usr/local/bin/flash --min-overlap 30 --max-mismatch-density 0.02 --output-directory=/app/data/NF_D1-3 /app/data/NF_D1-3/NF_D1-3.umi.fwd.fastq /app/data/NF_D1-3/NF_D1-3.umi.rev.fastq

[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     /app/data/NF_D1-3/NF_D1-3.umi.fwd.fastq
[FLASH]     /app/data/NF_D1-3/NF_D1-3.umi.rev.fastq
[FLASH]
[FLASH] Output files:
[FLASH]     /app/data/NF_D1-3/out.extendedFrags.fastq
[FLASH]     /app/data/NF_D1-3/out.notCombined_1.fastq
[FLASH]     /app/data/NF_D1-3/out.notCombined_2.fastq
[FLASH]     /app/data/NF_D1-3/out.hist
[FLASH]     /app/data/NF_D1-3/out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           30
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density:  0.020000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      4
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 4 combiner threads
[FLASH] Processed 25000 read pairs
[FLASH] Processed 50000 read pairs
[FLASH] Processed 51875 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      51875
[FLASH]     Combined pairs:   333
[FLASH]     Uncombined pairs: 51542
[FLASH]     Percent combined: 0.64%
[FLASH]
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (68.77%)
overlapped by more than 65 bp, the --max-overlap (-M) parameter.  Consider
increasing this parameter.  (As-is, FLASH is penalizing overlaps longer than
65 bp when considering them for possible combining!)
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] FLASH v1.2.11 complete!
[FLASH] 0.348 seconds elapsed
[FLASH] Finished with 1 warning (see above)

Sample 2:

ERROR 04:38:50,412 FunctionEdge - Error: /usr/local/bin/flash --min-overlap 30 --max-mismatch-density 0.02 --output-directory=/app/data/NF_D1-4 /app/data/NF_D1-4/NF_D1-4.umi.fwd.fastq /app/data/NF_D1-4/NF_D1-4.umi.rev.fastq
ERROR 04:38:50,417 FunctionEdge - Contents of /app/data/NF_D1-4/out.notCombined_1.fastq.out:
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     /app/data/NF_D1-4/NF_D1-4.umi.fwd.fastq
[FLASH]     /app/data/NF_D1-4/NF_D1-4.umi.rev.fastq
[FLASH]
[FLASH] Output files:
[FLASH]     /app/data/NF_D1-4/out.extendedFrags.fastq
[FLASH]     /app/data/NF_D1-4/out.notCombined_1.fastq
[FLASH]     /app/data/NF_D1-4/out.notCombined_2.fastq
[FLASH]     /app/data/NF_D1-4/out.hist
[FLASH]     /app/data/NF_D1-4/out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           30
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density:  0.020000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      4
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 4 combiner threads
[FLASH] Processed 25000 read pairs
[FLASH] Processed 50000 read pairs
[FLASH] Processed 62353 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      62353
[FLASH]     Combined pairs:   7537
[FLASH]     Uncombined pairs: 54816
[FLASH]     Percent combined: 12.09%
[FLASH]
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (96.90%)
overlapped by more than 65 bp, the --max-overlap (-M) parameter.  Consider
increasing this parameter.  (As-is, FLASH is penalizing overlaps longer than
65 bp when considering them for possible combining!)
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 0.404 seconds elapsed
[FLASH] Finished with 1 warning (see above)

Do you have an idea about how to fix the issue? Or is it because of bad sequence quality?

In addition, the current pipeline uses mix to construct a phylogenetic tree for a single sample? Is there a function to construct a single tree from multiple samples originated from the same experimental condition, and compare trees across different experimental conditions? Thanks in advance!

Best, Qian

aaronmck commented 3 years ago

Hi Qian,

Interesting, we haven't seen that before, though it's probably driven by very large edits (and the resulting large dropouts). I can push up a parameter to allow for larger overlaps between reads which should fix this. Alternatively, we could ignore warnings from Flash, though sometimes it good to see these things? Let me know what you think.

-Aaron

qinqian commented 3 years ago

Hi Aaron,

Thank you for your feedback!

For these two datasets, the Flash outputs an error together with warnings, after that, the pipeline fails to proceed, thus no .stats file will be generated. I think the parameter updates to allow for larger overlap sound good, could you please push a parameter? I could experiment from my side to see if .stats file could be generated or not, or we could discuss to share the data with you later if it is good to reproduce the issue from your side.

Best wishes, Qian

qinqian commented 3 years ago

Hi Aaron,

Happy New Year! Another quick question is the output of the current pipeline didn't include .umiCounts as mentioned in the README, could you please kindly take a look when it is convenient for you?

dome_4_1X.umiCounts information about each UMIs that we saw and if there were enough reads for an individual UMI to call it 'successfully captured'. You should check this file to see if there are any problems with the UMI collapsing step.

Best, Qian

aaronmck commented 3 years ago

Hi Qian,

I've put the flashMaxOverlap parameter into the pipeline and tested that's propagated to the flash program, though I don't have the data to test your exact issue, so please let me know if this doesn't fix it. I've also put some other bug fixes into the latest docker release and updated the main page to point that version instead of stable.

The umiCounts is there in both the previous and current versions, are you looking in the _/app/data/dome_41X/ folder? Let me know if it's not there.

Best, Aaron

qinqian commented 3 years ago

Hi Aaron,

Thank you so much for your help! I'll test the new pipeline soon.

Yes, the umiCounts is not there in the /app/data/output folder for previous version, I'll check if it is there for the current version. Thanks!

Best, Qian

qinqian commented 3 years ago

Hi Aaron,

The new version works perfectly for all our 93 samples. Thank you for your help!

Best, Qian

qinqian commented 3 years ago

Hi Aaron,

I found that there are significant differences in the result .stat files between the two versions. For example, one of the 93 samples has 3,951 cells in the latest version, but it was 1,760 cells in the previous version. The top alleles constitution varies in cell number as well. Does the change of FLASH parameters increase the cell and allele number called? Which version is more reliable?

Best, Qian

aaronmck commented 3 years ago

Hi Qian,

Sorry for the delay here, it's recruitment week. Are you running the pipelines from the Docker install or from your own installation? The Docker update went from stable to latest, and incorporated a number of updates. If this is just the pipeline script there shouldn't have been that big of an increase unless a lot of reads were getting dropped by the previous FLASH setup. Let me know, I'm happy to dig in a bit.

qinqian commented 3 years ago

Hi Aaron,

It's fine, good luck with the recruitment.

Yes, I am running both the pipelines from the Docker install. The docker images I have compared are aaronmck/single_cell_gestalt:SC_GESTALT and aaronmck/single_cell_gestalt:latest, the results are very different, the top 10 alleles have only 4 overlaps. Is the SC_GESTALT image the same as the stable version?

Best wishes, Qian

aaronmck commented 3 years ago

Hi Qian,

That makes sense; the SC_GESTALT tag was for Bushra's paper, and we've made a number of updates in the meantime. The largest change is to be more permissive of mismatches in the primer region, as we were finding some of the single-cell RNA-seq data sets to be a little more error-prone. This is probably the biggest reason I can think of that you'd get more UMIs passing through, though I'm happy to take a look if you have specific examples (feel free to email me if you'd like it to be more private).

qinqian commented 3 years ago

Hi Aaron,

Thank you for the explanation.

I will ask my PI about sharing the datasets with you by email later, is this your official email info@mckennalab.org?

Best wishes, Qian

aaronmck commented 3 years ago

Yeah I'll see things sent to that address. Best, Aaron

qinqian commented 3 years ago

Got it. Thanks!

Best, Qian