Closed fmalmeida closed 3 months ago
black
) is failingTo keep the code consistent with lots of contributors, we run automated code consistency checks. To fix this CI test, please run:
black
: pip install black
black .
Once you push these changes the test should pass, and you can hide this comment :+1:
We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!
Thanks again for your contribution!
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit 2406600
+| ✅ 171 tests passed |+
#| ❔ 4 tests were ignored |#
!| ❗ 3 tests had warnings |!
Hi @grst and @apeltzer , This PR is now ready for review and discussions. As promised, I am going to add a description of changes and open things.
First of all, I have added a script to perform the empty-drops calling and filtering using a library that is available in bioconda, bioconductor-dropletutils
.
With that, I added a module which is simple, it takes a matrix file, and performs the empty drops call on it, generating another matrix file.
Note: The module fails on small datasets or whenever there is not sufficient data points for filtering. Thus, it must work on "raw/unprocessed" matrices and should be skipped (
--skip_emptydrops=true
) for small datasets.
With the module generated, I could then include it in the workflow.
// Run emptydrops calling module
if ( !params.skip_emptydrops ) {
//
// emptydrops should only run on the raw matrices thus, filter-out the filtered result of the aligners that can produce it
//
if ( params.aligner in [ 'cellranger', 'cellrangerarc', 'kallisto', 'star' ] ) {
ch_mtx_matrices_for_emptydrops =
ch_mtx_matrices.filter { meta, mtx_files ->
mtx_files.toString().contains("raw_feature_bc_matrix") || // cellranger
mtx_files.toString().contains("counts_unfiltered") || // kallisto
mtx_files.toString().contains("raw") // star
}
} else {
ch_mtx_matrices_for_emptydrops = ch_mtx_matrices
}
EMPTYDROPS_CELL_CALLING( ch_mtx_matrices_for_emptydrops )
ch_mtx_matrices = ch_mtx_matrices.mix( EMPTYDROPS_CELL_CALLING.out.filtered_matrices )
}
One thing to note above is that, as discussed previously, I have to add a checker/filter in order to only pass on the raw/unprocessed matrices generated by the assemblers, because, I think it does not make sense to run the module in the already filtered/processed matrices.
Or do you think we should invert, only running in the processed ones?
Because now we will have both the data directly from the aligners, and a custom-made filtering module, I had to change the conversion modules a bit so they are aware of that, and can understand the difference between raw/filtered from the aligners itself and what is the custom empty drops filter.
With that, to try to avoid confusion by the user, I had to add such "suffixes" to the generated converted files, so now we write data with such sufixes:
*_{raw,filtered,custom_emptydrops_filter}_matrix.{h5ad,rds}
In this case, the meanings are: | suffix | meaning |
---|---|---|
raw | Conversion of the raw/unprocessed matrix generated by the tool. It is also used for tools that generate only one matrix, such as alevin. | |
filtered | Conversion of the filtered/processed matrix generated by the tool | |
custom_emptydrops_filter custom_emptydrops_filter | Conversion of the matrix that was generated by the new custom empty drops filter module |
I also had to update the two conversion-scripts for rds and h5ad, because now they also have to understand this new matrix generated, and to allow it to also transpose the matrices from cellranger, because, when we do the normal conversion of cellranger matrices, we use the .h5
files it generates, so it is very fine, but the cellranger-emptydrops filtering does not produce such file and then the conversion scripts need to understand it.
Finally, it has to be discussed the changes made in the aligners-connections to integrate this module. One by one.
Because alevin only produce on matrix result, without producing a pair (raw&filtered) as others like cellranger, star and kallisto, the integration was seamless and required no major change, just connecting it to the module.
When using it, the results generated are the following:
alevin_run/
├── alevin
│ ├── mtx_conversions
│ │ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ │ ├── combined_raw_matrix.h5ad
│ │ ├── pbmc8k
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ │ ├── pbmc8k_raw_matrix.h5ad
│ │ │ └── pbmc8k_raw_matrix.rds
│ │ └── versions.yml
│ ├── pbmc8k
│ │ └── emptydrops_filtered
│ │ ├── quants_mat_cols.txt
│ │ ├── quants_mat.mtx
│ │ └── quants_mat_rows.txt
│ ├── pbmc8k_alevin_results
│ │ ├── af_map
│ │ │ ├── alevin
│ │ │ ├── aux_info
│ │ │ ├── cmd_info.json
│ │ │ ├── libParams
│ │ │ ├── logs
│ │ │ ├── map.rad
│ │ │ └── unmapped_bc_count.bin
│ │ ├── af_quant
│ │ │ ├── alevin
│ │ │ ├── all_freq.bin
│ │ │ ├── collate.json
│ │ │ ├── featureDump.txt
│ │ │ ├── generate_permit_list.json
│ │ │ ├── map.collated.rad
│ │ │ ├── permit_freq.bin
│ │ │ ├── permit_map.bin
│ │ │ ├── quant.json
│ │ │ └── unmapped_bc_count_collated.bin
│ │ └── simpleaf_quant_log.json
├── alevinqc
├── fastqc
├── multiqc
└── pipeline_info
For kallisto, I first had to include a new parameter in the pipeline, called, --kb_filter
because it is possible that kallisto produce only one matrix or a pair (raw/filtered) depending on whether the bustools --filter
command is used or not. So, I added this new parameter to make sure I could include the module taking account of both scenarios.
Finally, to make sure I had a channel that could be properly used, and of course, that we could filter raw / filtered data to choose what to pass on to the empty drops filter module, I had to modify the generated channels by the module (see here).
tuple val(meta), path ("*.count") , emit: count
tuple val(meta), path ("*.count/counts_unfiltered"), emit: raw_counts // TODO: Add to nf-coew/modules before merging PR
tuple val(meta), path ("*.count/counts_filtered") , emit: filtered_counts, optional: true // TODO: Add to nf-coew/modules before merging PR
Then, of course, I updated the downstream snippets of the codes in the suf-workflows and workflow to understand it.
I know these changes shall got to nf-core/modules first. I added here first for discussion, when all solved, I can add to nf-core/modules whatever is needed.
Results look like this, when --filter
is on.
kallisto_lamanno_run
├── fastqc
├── kallisto
│ ├── mtx_conversions
│ │ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ │ ├── combined_filtered_matrix.h5ad
│ │ ├── combined_raw_matrix.h5ad
│ │ ├── pbmc8k
│ │ │ ├── pbmc8k_spliced_matrix.h5ad
│ │ │ ├── pbmc8k_spliced_matrix.rds
│ │ │ ├── pbmc8k_unspliced_matrix.h5ad
│ │ │ └── pbmc8k_unspliced_matrix.rds
│ │ └── versions.yml
│ ├── pbmc8k.count
│ │ ├── 10x_version2_whitelist.txt
│ │ ├── counts_filtered
│ │ │ ├── spliced.barcodes.txt
│ │ │ ├── spliced.genes.txt
│ │ │ ├── spliced.mtx
│ │ │ ├── unspliced.barcodes.txt
│ │ │ ├── unspliced.genes.txt
│ │ │ └── unspliced.mtx
│ │ ├── counts_unfiltered
│ │ │ ├── spliced.barcodes.txt
│ │ │ ├── spliced.genes.txt
│ │ │ ├── spliced.mtx
│ │ │ ├── unspliced.barcodes.txt
│ │ │ ├── unspliced.genes.txt
│ │ │ └── unspliced.mtx
│ │ ├── emptydrops_filtered
│ │ │ ├── spliced.barcodes.txt
│ │ │ ├── spliced.genes.txt
│ │ │ ├── spliced.mtx
│ │ │ ├── unspliced.barcodes.txt
│ │ │ ├── unspliced.genes.txt
│ │ │ └── unspliced.mtx
│ │ ├── filter_barcodes.txt
│ │ ├── inspect.json
│ │ ├── inspect.spliced.json
│ │ ├── inspect.unspliced.json
│ │ ├── kb_info.json
│ │ ├── matrix.ec
│ │ ├── output.bus
│ │ ├── output.filtered.bus
│ │ ├── output.unfiltered.bus
│ │ ├── run_info.json
│ │ ├── spliced.filtered.bus
│ │ ├── spliced.unfiltered.bus
│ │ ├── transcripts.txt
│ │ ├── unspliced.filtered.bus
│ │ └── unspliced.unfiltered.bus
│ └── versions.yml
├── multiqc
└── pipeline_info
kallisto_run
├── fastqc
├── kallisto
│ ├── mtx_conversions
│ │ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ │ ├── combined_raw_matrix.h5ad
│ │ ├── pbmc8k
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ │ ├── pbmc8k_raw_matrix.h5ad
│ │ │ └── pbmc8k_raw_matrix.rds
│ │ └── versions.yml
│ ├── pbmc8k.count
│ │ ├── 10x_version2_whitelist.txt
│ │ ├── counts_unfiltered
│ │ │ ├── cells_x_genes.barcodes.txt
│ │ │ ├── cells_x_genes.genes.txt
│ │ │ └── cells_x_genes.mtx
│ │ ├── emptydrops_filtered
│ │ │ ├── cells_x_genes.barcodes.txt
│ │ │ ├── cells_x_genes.genes.txt
│ │ │ └── cells_x_genes.mtx
│ │ ├── inspect.json
│ │ ├── kb_info.json
│ │ ├── matrix.ec
│ │ ├── output.bus
│ │ ├── output.unfiltered.bus
│ │ ├── run_info.json
│ │ └── transcripts.txt
│ └── versions.yml
├── multiqc
└── pipeline_info
For cellranger, basically it happened the same to Kallisto. The difference is that cellranger always produces a pair of raw/filtered and then I had to just modify the channels to account for that so would make filtering later on easier (see here)
I also had to update the conversion-scripts because the emptydrops filter module does not produce a .h5
file so the scripts where not ready for converting .mtx
cellranger files.
Results for it looks like this:
cellranger_run/
├── cellranger
│ ├── count
│ │ ├── pbmc8k
│ │ │ ├── emptydrops_filtered
│ │ │ └── outs
│ │ └── versions.yml
│ ├── mkgtf
│ │ └── genome_genes.filtered.gtf
│ ├── mkref
│ │ ├── cellranger_reference
│ │ │ ├── fasta
│ │ │ ├── genes
│ │ │ ├── reference.json
│ │ │ └── star
│ │ └── versions.yml
│ └── mtx_conversions
│ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ ├── combined_filtered_matrix.h5ad
│ ├── combined_raw_matrix.h5ad
│ ├── pbmc8k
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ ├── pbmc8k_filtered_matrix.h5ad
│ │ ├── pbmc8k_filtered_matrix.rds
│ │ ├── pbmc8k_raw_matrix.h5ad
│ │ └── pbmc8k_raw_matrix.rds
│ └── versions.yml
├── fastqc
├── multiqc
└── pipeline_info
For star, basically the same for cellranger. It always produce a raw/filtered pair, but I had to adjust the out-channels to make the filtering/selection easier. Of course, adjusting all the downstream channel selections to account for them.
The results for it look like this:
star_run/
├── fastqc
├── multiqc
├── pipeline_info
└── star
├── mtx_conversions
│ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ ├── combined_filtered_matrix.h5ad
│ ├── combined_raw_matrix.h5ad
│ ├── pbmc8k
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ ├── pbmc8k_filtered_matrix.h5ad
│ │ ├── pbmc8k_filtered_matrix.rds
│ │ ├── pbmc8k_raw_matrix.h5ad
│ │ └── pbmc8k_raw_matrix.rds
│ └── versions.yml
└── pbmc8k
├── emptydrops_filtered
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
├── pbmc8k.Aligned.sortedByCoord.out.bam
├── pbmc8k.Log.final.out
├── pbmc8k.Log.out
├── pbmc8k.Log.progress.out
├── pbmc8k.SJ.out.tab
├── pbmc8k.Solo.out
│ ├── Barcodes.stats
│ └── Gene
└── versions.yml
I could not even run them, so could not be tested nor integrated. Thus, I believe they first need to be solved in order to have a proper testing profile to allow integrating these modules here.
Just not sure what should go first.
You will see in the scrnaseq.nf
workflow, that even after generating a proper channel for raw/filtered in the modules of cellranger, star and kallisto I still mix them all in the general ch_mtx_matrices
channel.
I do this to guarantee all results are going to the downstream analysis. The reason I modified the modules to generate these split channels, was just to make sure they grabbed the folder which contains only the raw/filtered results to allow a proper filtering. Before they were using a higher level ** selection and then all files were being dumped in the channel in a flatten mode, so the filtering was very complicate and resulting in files with duplicated names.
You see that I had to modify the conversion subworkflow to account for that as well.
Looks good to me apart from some parts:
- Changes in the module code --> need to go upstream, ideally open PRs already for this
- Upgrades in the respective upstream modules might be necessary, check out the cellranger update PR, which could be merged prior to updating the modules here I believe --> should be easy to do..
- Some more docs on what this sfeature does would be helpful - so that people can both see it in the changelog and also in the main documentation
Yes, the changes in the modules I added as a TODO
so that, once we know and agree on all changes required, it can be done in nf-core/modules
.
About the docs, I will work on it.
Hi @fmalmeida,
this looks really good and also many thanks for the detailled explanation above!
My only main concern would be the transposition thing, maybe there's a better solution that doesn't require too much effort?
Mid-term, I think we should restructure the mtx conversion part of the pipeline as it becomes a bit messy. This is beyond the scope of this issue though, so I created https://github.com/nf-core/scrnaseq/issues/310.
Regarding cellranger-arc: I think it should run now. But not entirely sure if it makes sense to run emptydrops on scATAC-seq data in the first place? @heylf, WDYT?
EDIT: seems https://github.com/nf-core/scrnaseq/pull/300 needs to be merged first.
@nf-core-bot fix linting
Regarding cellranger-arc: I think it should run now. But not entirely sure if it makes sense to run emptydrops on scATAC-seq data in the first place? @heylf, WDYT?
Lets say I have never compared filtering techniques for scATAC data. In the past I saw you can use emptydrop for scATAC, but emptydrops is specifcally for scRNA, so I would be careful in doing that.
Hi @grst , So, the overall summary after reading all the comments, so that can finalise this PR is:
cellrangerarc
as @heylf said, it will probably not make sense. If in the future it is requested, we can work specifically on it.Is that all?
sounds good
@nf-core-bot fix linting
Now that kallisto was updated, and the workflows it provides are different. I will have to test it with them as well. Once they work, the last thing required will be the docs and the PRs to modules.
Hi @grst ,
Can you take a look again at the changes?
docs/usage.md
cellranger/count
module: https://github.com/nf-core/modules/pull/5108kallistobustools/count
module: https://github.com/nf-core/modules/pull/5110/fileskb_workflows
.The only one missing is the last one, which is currently running.
I'm wondering if instead of updating all those modules it would be easier to do something like
ch_filtered = ch_out.map{
meta, files -> [meta, out.findAll{ it -> it.contains("filtered") }]
}
I'm wondering if instead of updating all those modules it would be easier to do something like
ch_filtered = ch_out.map{ meta, files -> [meta, out.findAll{ it -> it.contains("filtered") }] }
That was my first try. But because many of them use the ** to grab the files. It catches the files and not the directories. So, when filtering, we only select the files that have it in its name instead of the directory and all that is inside.
I added some information in the last section of this comment https://github.com/nf-core/scrnaseq/pull/301#issuecomment-1945592102
I see, fair enough then
Pipeline execution terminated.
As said here, the nf-core modules were updated and TODOs removed. Documentation was updated. And also, workflow (kallisto) was updated so that the new modules works for both non-standard kallisto workflows, lamanno and nac.
Results structure organization and namings are being produced as said here.
Finally, all testings passed, so, merging the PR 😄
Right now, just opening a draft PR on the attempt of solving issue #81 so that it is easier to keep track of modifications.
One work is "done" I will add a thorough overview of the changes, with explanations of the main modifications and listing on TODOs to be addressed before merging.
Then, of course, only then I will add some reviewers.