Open DaianeH opened 4 years ago
Hi Daiane,
Yes if you could post the actual error CELLECT returns in its log that would be useful.
Best, Tobi
There are 96 errors like this one:
Error in rule format_and_map_genes: jobid: 182 output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.22.bed log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.22.txt (check log file(s) for error message) conda-env: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/conda/216cc198
File log.format_and_map_snake.dropviz-hippocampus.22.txt where the error messages should be is empty.
Hi Daian,
Thanks - that's helpful. Would you be able to send over all of your cell-type/cell-cluster names too? I suspect they may be the issue.
Just to confirm - several of the rules complete successfully it is just 96 that do not?
Best, Tobi
Hi Tobi,
Correct, several rules complete, but 96 don't (with an error but no descriptive error message). I think I managed to successfully attach the full .log file, let me know if not.
2020-03-05T110455.599906.snakemake.log
Cell cluster names are:
11-1 6-4 6-1 5-11 5-3 6-3 6-6 5-7 5-6 5-14 1-23 1-14 5-1 6 5-4 1-24 6-5 1-19 6-8 1 15 5-8 6-7 4 1-21 4-1 5-9 5-2 8-1 1-7 1-20 1-6 5 1-15 3-7 5-12 3-3 2-6 15-1 11 1-12 3-6 3-5 1-26 7 15-2 13-2 1-8 3-8 1-22 3-10 2-1 2-5 1-2 1-18 7-1 2-7 8-5 13-3 8-4 3-2 1-9 4-2 1-25 9 3-1 1-13 8-3 2-2 1-11 3-11 5-5 9-1 1-27 6-2 2-4 1-17 1-16 16 8 5-15 7-2 1-5 16-1 2 8-2 1-1 5-10 7-3 17-2 17 16-4 17-3 5-13 2-3 1-10 10 13-1 1-3 17-1 9-4 3-9 9-2 13-5 3 1-4 14 13-4 18 9-3 17-4 16-2 12 NA 10-1 13-6 10-2 16-3 13 21 20 22 3-4 19
Thanks, Daiane.
Hi Daiane,
I don't think any of the rules are completing - your log file indicates you're using 96 cores so only 96 jobs can run in parallel at any one time, and all of them crash.
I believe it may be a result of the naming scheme of the cell-types. I have a solution which you should try: Rename your cell-types using letters e.g. '19' -> 'CT19', '3-4' -> 'CT3-4', etc. I think it may be an issue with CELLECT's use of regex and adding non-numerical characters could resolve it.
The above should be fairly straightforward to do the above in a text editor (without typing 'CT' 96 times) or with a bit of bash code but if you need a hand or if it doesn't work, let me know.
Best, Tobi
Hi Tobi, I tried adding an H in front of every cell-type since all cells come from the Hippocampus. It did not solve the issue. Log attached. Thanks, Daiane.
Hi Daiane,
Hmm, I am stumped and really can't infer anything else from what you've sent.
If you can upload the CELLEX-processed dataset I'll try giving it a whirl myself - hopefully that will make it easier to diagnose the problem.
Best, Tobi
Hi Tobi,
CELLEX-processed dataset attached. Thanks!
@Tobi1kenobi : I agree that this could be an issue with snakemake regex. @DaianeH : It could be a problem to have cell-types named e.g. H1, H1-1 and H1-11. For a quick fix of this, I suggest removing the hyphens. Can you check if that works? We will, of course, look into this for a better long term solution.
H1
H1-1
H1-10
H1-11
H1-12
H1-13
H1-14
H1-15
H1-16
H1-17
H1-18
H1-19
H1-2
H1-20
H1-21
H1-22
H1-23
H1-24
H1-25
H1-26
H1-27
H1-3
H1-4
H1-5
H1-6
H1-7
H1-8
H1-9
H10
H10-1
H10-2
H11
H11-1
H12
H13
H13-1
H13-2
H13-3
H13-4
H13-5
H13-6
H14
H15
H15-1
H15-2
H16
H16-1
H16-2
H16-3
H16-4
H17
H17-1
H17-2
H17-3
H17-4
H18
H19
H2
H2-1
H2-2
H2-3
H2-4
H2-5
H2-6
H2-7
H20
H21
H22
H3
H3-1
H3-10
H3-11
H3-2
H3-3
H3-4
H3-5
H3-6
H3-7
H3-8
H3-9
H4
H4-1
H4-2
H5
H5-1
H5-10
H5-11
H5-12
H5-13
H5-14
H5-15
H5-2
H5-3
H5-4
H5-5
H5-6
H5-7
H5-8
H5-9
H6
H6-1
H6-2
H6-3
H6-4
H6-5
H6-6
H6-7
H6-8
H7
H7-1
H7-2
H7-3
H8
H8-1
H8-2
H8-3
H8-4
H8-5
H9
H9-1
H9-2
H9-3
H9-4
OTHER
(@Tobi1kenobi : the problem reminds me of https://github.com/perslab/CELLECT/issues/25 )
Hi Daiane,
So I no longer believe it's a problem with your cell-type names. I tried running your CELLEX specificity matrix and I get an error message:
ValueError: file_multi_gene_set contains duplicated genes within an annotation (i.e. non-unique combinations of 'annotation' and 'gene_input' columns). Fix and rerun.
Not sure why this doesn't come up when you run CELLECT or get sent to the log file, we'll have to look into this.
But a quick check for duplicate genes:
cat brain_dropviz_hippocampus.esmu_human_cellect_test.csv | cut -d, -f 1 | sort | uniq -c | sort
Reveals that there are many in this dataset and this is almost certainly your problem.
I think it's because you did the mapping step yourself. CELLEX has its own built-in mapping functionality that you can use or if you do want to use your own mapping functionality make sure to drop duplicates.
Hopefully everything should work now.
Best, Tobi
Thanks Tobi, it makes sense. So if I want to use CELLEX mapping functionality, first I need to map mouse gene names to mouse Ensembl ID, then mouse Ensembl ID to human Ensembl ID. The latter is done by "cellex.utils.mapping.ens_mouse_to_ens_human", but how can I map mouse gene names to mouse Ensembl ID?
A similar function exists for mapping MGI gene names to Ensembl mouse it should be cellex.utils.mapping.mgi_mouse_to_ens_mouse
.
@tstannius can you confirm this is correct?
Best, Tobi
Yep cellex.utils.mapping.mgi_mouse_to_ens_mouse
loads a file maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz
within the CELLEX repo , and defines a dictionary of {mgi gene name: ensembl_gene_id}
.
The doc string however erroneously states the function as "Maps mouse ensembl gene id's to human ensembl gene id's" though.
I'm not sure if the .txt.gz file is being loaded, though:
eso_mapped = cellex.utils.mapping.mgi_mouse_to_ens_mouse(eso.results["esmu"]) Traceback (most recent call last): File "
", line 1, in File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream self, resource_name File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream return io.BytesIO(self.get_resource_string(manager, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string return self._get(self._fn(self.module_path, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get return self.loader.get_data(path) OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'
Hi Tobi,
Would it perhaps be possible for you to try to run the cellex.utils.mapping.mgi_mouse_to_ens_mouse function on the attached file? Just to check if you get the same error I reported above.
Many thanks, Daiane.
Thanks for pointing out this error! I also experience the same issue with the txt.gz files not being loaded, so definitely a CELLEX bug here.
Just to check, within your /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/
-directory, is there a map
-directory?
So I can't really see anything after:
/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/
because it says it's not a directory.
Not experienced in python but from what I read an .egg is the same as a zipped file, so when I do:
unzip /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg
I get all the directories and files under CELLEX... but no, there's no 'map' directory. The closest thing to it is:
cellex/utils/mapping/init.py cellex/utils/mapping/ens_human_to_symbol.py cellex/utils/mapping/ens_mouse_to_ens_human.py cellex/utils/mapping/mgi_mouse_to_ens_mouse.py cellex/utils/mapping/pycache/init.cpython-37.pyc cellex/utils/mapping/pycache/ens_human_to_symbol.cpython-37.pyc cellex/utils/mapping/pycache/ens_mouse_to_ens_human.cpython-37.pyc cellex/utils/mapping/pycache/mgi_mouse_to_ens_mouse.cpython-37.pyc
CELLEX has now been updated by @tstannius with some hotfixes for the mapping functions.
I managed to convert all the MGI gene names to Ensembl gene names from the brain_dropviz_hippocampus.esmu.csv.gz
by:
import cellex as cellex
import pandas as pd
df = pd.read_csv('brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene') #important step before mapping!
cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True)
And then when you call the df
it should now contain Mouse ensembl id gene names. Let us know if it works!
Hi, thanks for the update!
I did exactly what you typed above and got the error:
import numpy as np import cellex as cellex import pandas as pd df = pd.read_csv('/brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip') df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True) df = df.set_index('gene') cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True) Mapping: mouse mgi gene id's --> mouse ensembl gene id's ... Traceback (most recent call last): File "
", line 1, in File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream self, resource_name File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream return io.BytesIO(self.get_resource_string(manager, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string return self._get(self._fn(self.module_path, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get return self.loader.get_data(path) OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'
Apologies for the super basic question, but does CELLEX need to be installed again or something?
No worries - no questions are too basic! But yeah, it seems like your server still has CELLEX v.1.0.1, which means that the bug from yesterday
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'
would still occur.
Did you happen to install CELLEX using pip? If yes, you can update it with a pip install -U cellex
and get the latest CELLEX v. 1.1.1
Ok, so now CELLEX is updated and I can generate the hippocampus_cellect.esmu.csv.gz with ENSEMBL id's. However, when I run CELLECT on it, now I get another error... well in fact I don't get an error, I just get:
... [Sun Mar 15 22:33:49 2020] Finished job 215. 140 of 240 steps (58%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/log/2020-03-15T223106.425023.snakemake.log
And I cannot exactly know which job execution failed - there's no errors in the log file, only a few warnings. Could you please be so kind to have a look at the log file and help me on this? Thanks!
Hi again, I've looked into the 2020-03-15T223106.425023.snakemake.log
file and spotted the following error near the bottom:
Missing files after 5 seconds:
/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed
So the point where things start to go wrong is at the format_and_map_genes
step, which outputs a .bed
for each cell type present in your CELLEX'ed dataset:
[Sun Mar 15 22:31:34 2020]
rule format_and_map_genes:
input: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/multi_genesets/multi_geneset.dropviz-hippocampus.txt
output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed
log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.nan.txt
jobid: 238
wildcards: BASE_OUTPUT_DIR=/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE, run_prefix=dropviz-hippocampus, annotation=nan
It would perhaps seem like one of your celltypes in the hippocampus_cellect.esmu.csv.gz
have a nan
as their celltype label and thus might messes up the script (perhaps because nan
doesn't translate very well when converted to string? What's your take on this, @Tobi1kenobi?) and thus required files .bed
files to proceed are not generated, hence snakemake cancels further job execution.
Sorry if this is inconvenient, but I might have to ask you to find the cell types in the CELLEX'ed dataset with the nan
labels and rename them.
I'd agree, this sounds like #3 . It should've been fixed a long time ago but I think has been forgotten about, apologies.
But as @bengnielsen said, I'd expect renaming 'nan' as 'unknown' (for example) would resolve the issue.
Hi there,
Changing the name of the cluster from 'nan' to 'Unknown' worked.
So just to close this thread:
1 - CELLECT throws an error when the name of a cluster is "nan" 2 - You suggested I put a letter in front of the cluster numbers, but that doesn't make a difference when there's an hyphen in the cluster name (when there's numbers and hyphens, the cluster name is read as a string). However, I run CELLECT in another dataset with cluster names represented only as numbers, and it threw an error. When adding a letter in front of the cluster numbers, the error was gone. So CELLECT will throw an error when cluster names are numbers only (it must be a string).
Many thanks for your help!
Thanks @DaianeH !
To fix this the code should either : 1) fail explicitly (before staring Snakemake workflow) when number-only or nan encountered
or
2) improve snakemake regex(?) to allow for number-only or nan annotation names
Hi there,
I run CELLEX and now am running CELLECT on Hippocampus dataset of http://dropviz.org. Genes are originally mouse gene names, I mapped them to human Ensembl IDs using Ensembl Biomart.
My config.yml looks like:
BASE_OUTPUT_DIR: ./CELLECT-LDSC-EXAMPLE
SPECIFICITY_INPUT:
GWAS_SUMSTATS:
ANALYSIS_TYPE: prioritization: True conditional: False heritability: False heritability_intervals: False
WINDOW_DEFINITION: WINDOW_SIZE_KB: 100 WINDOW_LD_BASED: False
LDSC_CONST: DATA_DIR: data/ldsc LDSC_DIR: ldsc NUMPY_CORES: 1
When I run:
snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile example/config-ldsc_hippocampus_dropviz.yml
Several "rule format_and_map_genes:" work, until at some point, I start getting: "Error in rule format_and_map_genes:" The blocks where this happens have a description of jobid, input, output, log and conda-env, but the log where the error was supposed to be reported is empty.
I successfully run CELLECT on datasets prepared on CELLEX before, but can't trace what's going on with this one. Would it be helpful if I send you the entire .log file of the CELLECT run and perhaps the esmu dataset as well?
Many thanks in advance,