Check expression specificity annotation names (ES header)

DaianeH commented 4 years ago

Hi there,

I run CELLEX and now am running CELLECT on Hippocampus dataset of http://dropviz.org. Genes are originally mouse gene names, I mapped them to human Ensembl IDs using Ensembl Biomart.

My config.yml looks like:

BASE_OUTPUT_DIR: ./CELLECT-LDSC-EXAMPLE

SPECIFICITY_INPUT:

id: dropviz-hippocampus path: /hpc/users/hemerd01/daiane/projects/scRNAseq/data/brain_mouse_dropviz/hippocampus/brain_dropviz_hippocampus.esmu_human_cellect.csv

GWAS_SUMSTATS:

id: BMI_Yengo2018 path: example/BMI_Yengo2018.sumstats.gz

ANALYSIS_TYPE: prioritization: True conditional: False heritability: False heritability_intervals: False

WINDOW_DEFINITION: WINDOW_SIZE_KB: 100 WINDOW_LD_BASED: False

LDSC_CONST: DATA_DIR: data/ldsc LDSC_DIR: ldsc NUMPY_CORES: 1

When I run:

snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile example/config-ldsc_hippocampus_dropviz.yml

Several "rule format_and_map_genes:" work, until at some point, I start getting: "Error in rule format_and_map_genes:" The blocks where this happens have a description of jobid, input, output, log and conda-env, but the log where the error was supposed to be reported is empty.

I successfully run CELLECT on datasets prepared on CELLEX before, but can't trace what's going on with this one. Would it be helpful if I send you the entire .log file of the CELLECT run and perhaps the esmu dataset as well?

Many thanks in advance,

Tobi1kenobi commented 4 years ago

Hi Daiane,

Yes if you could post the actual error CELLECT returns in its log that would be useful.

Best, Tobi

DaianeH commented 4 years ago

There are 96 errors like this one:

Error in rule format_and_map_genes: jobid: 182 output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.22.bed log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.22.txt (check log file(s) for error message) conda-env: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/conda/216cc198

File log.format_and_map_snake.dropviz-hippocampus.22.txt where the error messages should be is empty.

Tobi1kenobi commented 4 years ago

Hi Daian,

Thanks - that's helpful. Would you be able to send over all of your cell-type/cell-cluster names too? I suspect they may be the issue.

Just to confirm - several of the rules complete successfully it is just 96 that do not?

Best, Tobi

DaianeH commented 4 years ago

Hi Tobi,

Correct, several rules complete, but 96 don't (with an error but no descriptive error message). I think I managed to successfully attach the full .log file, let me know if not.

2020-03-05T110455.599906.snakemake.log

Cell cluster names are:

11-1 6-4 6-1 5-11 5-3 6-3 6-6 5-7 5-6 5-14 1-23 1-14 5-1 6 5-4 1-24 6-5 1-19 6-8 1 15 5-8 6-7 4 1-21 4-1 5-9 5-2 8-1 1-7 1-20 1-6 5 1-15 3-7 5-12 3-3 2-6 15-1 11 1-12 3-6 3-5 1-26 7 15-2 13-2 1-8 3-8 1-22 3-10 2-1 2-5 1-2 1-18 7-1 2-7 8-5 13-3 8-4 3-2 1-9 4-2 1-25 9 3-1 1-13 8-3 2-2 1-11 3-11 5-5 9-1 1-27 6-2 2-4 1-17 1-16 16 8 5-15 7-2 1-5 16-1 2 8-2 1-1 5-10 7-3 17-2 17 16-4 17-3 5-13 2-3 1-10 10 13-1 1-3 17-1 9-4 3-9 9-2 13-5 3 1-4 14 13-4 18 9-3 17-4 16-2 12 NA 10-1 13-6 10-2 16-3 13 21 20 22 3-4 19

Thanks, Daiane.

Tobi1kenobi commented 4 years ago

Hi Daiane,

I don't think any of the rules are completing - your log file indicates you're using 96 cores so only 96 jobs can run in parallel at any one time, and all of them crash.

I believe it may be a result of the naming scheme of the cell-types. I have a solution which you should try: Rename your cell-types using letters e.g. '19' -> 'CT19', '3-4' -> 'CT3-4', etc. I think it may be an issue with CELLECT's use of regex and adding non-numerical characters could resolve it.

The above should be fairly straightforward to do the above in a text editor (without typing 'CT' 96 times) or with a bit of bash code but if you need a hand or if it doesn't work, let me know.

Best, Tobi

DaianeH commented 4 years ago

Hi Tobi, I tried adding an H in front of every cell-type since all cells come from the Hippocampus. It did not solve the issue. Log attached. Thanks, Daiane.

2020-03-05T161448.352052.snakemake.log

Tobi1kenobi commented 4 years ago

Hi Daiane,

Hmm, I am stumped and really can't infer anything else from what you've sent.

If you can upload the CELLEX-processed dataset I'll try giving it a whirl myself - hopefully that will make it easier to diagnose the problem.

Best, Tobi

DaianeH commented 4 years ago

Hi Tobi,

CELLEX-processed dataset attached. Thanks!

brain_dropviz_hippocampus.esmu_human_cellect_test.csv.zip

pascaltimshel commented 4 years ago

@Tobi1kenobi : I agree that this could be an issue with snakemake regex. @DaianeH : It could be a problem to have cell-types named e.g. H1, H1-1 and H1-11. For a quick fix of this, I suggest removing the hyphens. Can you check if that works? We will, of course, look into this for a better long term solution.

H1
H1-1
H1-10
H1-11
H1-12
H1-13
H1-14
H1-15
H1-16
H1-17
H1-18
H1-19
H1-2
H1-20
H1-21
H1-22
H1-23
H1-24
H1-25
H1-26
H1-27
H1-3
H1-4
H1-5
H1-6
H1-7
H1-8
H1-9
H10
H10-1
H10-2
H11
H11-1
H12
H13
H13-1
H13-2
H13-3
H13-4
H13-5
H13-6
H14
H15
H15-1
H15-2
H16
H16-1
H16-2
H16-3
H16-4
H17
H17-1
H17-2
H17-3
H17-4
H18
H19
H2
H2-1
H2-2
H2-3
H2-4
H2-5
H2-6
H2-7
H20
H21
H22
H3
H3-1
H3-10
H3-11
H3-2
H3-3
H3-4
H3-5
H3-6
H3-7
H3-8
H3-9
H4
H4-1
H4-2
H5
H5-1
H5-10
H5-11
H5-12
H5-13
H5-14
H5-15
H5-2
H5-3
H5-4
H5-5
H5-6
H5-7
H5-8
H5-9
H6
H6-1
H6-2
H6-3
H6-4
H6-5
H6-6
H6-7
H6-8
H7
H7-1
H7-2
H7-3
H8
H8-1
H8-2
H8-3
H8-4
H8-5
H9
H9-1
H9-2
H9-3
H9-4
OTHER

pascaltimshel commented 4 years ago

(@Tobi1kenobi : the problem reminds me of https://github.com/perslab/CELLECT/issues/25 )

Tobi1kenobi commented 4 years ago

Hi Daiane,

So I no longer believe it's a problem with your cell-type names. I tried running your CELLEX specificity matrix and I get an error message:

ValueError: file_multi_gene_set contains duplicated genes within an annotation (i.e. non-unique combinations of 'annotation' and 'gene_input' columns). Fix and rerun.

Not sure why this doesn't come up when you run CELLECT or get sent to the log file, we'll have to look into this.

But a quick check for duplicate genes: cat brain_dropviz_hippocampus.esmu_human_cellect_test.csv | cut -d, -f 1 | sort | uniq -c | sort

Reveals that there are many in this dataset and this is almost certainly your problem.

I think it's because you did the mapping step yourself. CELLEX has its own built-in mapping functionality that you can use or if you do want to use your own mapping functionality make sure to drop duplicates.

Hopefully everything should work now.

Best, Tobi

DaianeH commented 4 years ago

Thanks Tobi, it makes sense. So if I want to use CELLEX mapping functionality, first I need to map mouse gene names to mouse Ensembl ID, then mouse Ensembl ID to human Ensembl ID. The latter is done by "cellex.utils.mapping.ens_mouse_to_ens_human", but how can I map mouse gene names to mouse Ensembl ID?

Tobi1kenobi commented 4 years ago

A similar function exists for mapping MGI gene names to Ensembl mouse it should be cellex.utils.mapping.mgi_mouse_to_ens_mouse.

@tstannius can you confirm this is correct?

Best, Tobi

bengnielsen commented 4 years ago

Yep cellex.utils.mapping.mgi_mouse_to_ens_mouse loads a file maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz within the CELLEX repo , and defines a dictionary of {mgi gene name: ensembl_gene_id}.

The doc string however erroneously states the function as "Maps mouse ensembl gene id's to human ensembl gene id's" though.

DaianeH commented 4 years ago

I'm not sure if the .txt.gz file is being loaded, though:

eso_mapped = cellex.utils.mapping.mgi_mouse_to_ens_mouse(eso.results["esmu"]) Traceback (most recent call last): File "", line 1, in File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream self, resource_name File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream return io.BytesIO(self.get_resource_string(manager, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string return self._get(self._fn(self.module_path, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get return self.loader.get_data(path) OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

DaianeH commented 4 years ago

Hi Tobi,

Would it perhaps be possible for you to try to run the cellex.utils.mapping.mgi_mouse_to_ens_mouse function on the attached file? Just to check if you get the same error I reported above.

Many thanks, Daiane.

brain_dropviz_hippocampus.esmu.csv.gz

bengnielsen commented 4 years ago

Thanks for pointing out this error! I also experience the same issue with the txt.gz files not being loaded, so definitely a CELLEX bug here.

Just to check, within your /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/ -directory, is there a map-directory?

DaianeH commented 4 years ago

So I can't really see anything after:

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/

because it says it's not a directory.

Not experienced in python but from what I read an .egg is the same as a zipped file, so when I do:

unzip /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg

I get all the directories and files under CELLEX... but no, there's no 'map' directory. The closest thing to it is:

cellex/utils/mapping/init.py cellex/utils/mapping/ens_human_to_symbol.py cellex/utils/mapping/ens_mouse_to_ens_human.py cellex/utils/mapping/mgi_mouse_to_ens_mouse.py cellex/utils/mapping/pycache/init.cpython-37.pyc cellex/utils/mapping/pycache/ens_human_to_symbol.cpython-37.pyc cellex/utils/mapping/pycache/ens_mouse_to_ens_human.cpython-37.pyc cellex/utils/mapping/pycache/mgi_mouse_to_ens_mouse.cpython-37.pyc

bengnielsen commented 4 years ago

CELLEX has now been updated by @tstannius with some hotfixes for the mapping functions.

I managed to convert all the MGI gene names to Ensembl gene names from the brain_dropviz_hippocampus.esmu.csv.gz by:

import cellex as cellex
import pandas as pd

df = pd.read_csv('brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene')   #important step before mapping! 

cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True)

And then when you call the df it should now contain Mouse ensembl id gene names. Let us know if it works!

DaianeH commented 4 years ago

Hi, thanks for the update!

I did exactly what you typed above and got the error:

import numpy as np import cellex as cellex import pandas as pd df = pd.read_csv('/brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip') df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True) df = df.set_index('gene') cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True) Mapping: mouse mgi gene id's --> mouse ensembl gene id's ... Traceback (most recent call last): File "", line 1, in File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream self, resource_name File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream return io.BytesIO(self.get_resource_string(manager, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string return self._get(self._fn(self.module_path, resource_name)) File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get return self.loader.get_data(path) OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

Apologies for the super basic question, but does CELLEX need to be installed again or something?

bengnielsen commented 4 years ago

No worries - no questions are too basic! But yeah, it seems like your server still has CELLEX v.1.0.1, which means that the bug from yesterday

OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

would still occur.

Did you happen to install CELLEX using pip? If yes, you can update it with a pip install -U cellex and get the latest CELLEX v. 1.1.1

DaianeH commented 4 years ago

Ok, so now CELLEX is updated and I can generate the hippocampus_cellect.esmu.csv.gz with ENSEMBL id's. However, when I run CELLECT on it, now I get another error... well in fact I don't get an error, I just get:

... [Sun Mar 15 22:33:49 2020] Finished job 215. 140 of 240 steps (58%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/log/2020-03-15T223106.425023.snakemake.log

And I cannot exactly know which job execution failed - there's no errors in the log file, only a few warnings. Could you please be so kind to have a look at the log file and help me on this? Thanks!

2020-03-15T223106.425023.snakemake.log

bengnielsen commented 4 years ago

Hi again, I've looked into the 2020-03-15T223106.425023.snakemake.log file and spotted the following error near the bottom:

Missing files after 5 seconds:
/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed

So the point where things start to go wrong is at the format_and_map_genes step, which outputs a .bed for each cell type present in your CELLEX'ed dataset:

[Sun Mar 15 22:31:34 2020]
rule format_and_map_genes:
    input: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/multi_genesets/multi_geneset.dropviz-hippocampus.txt
    output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed
    log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.nan.txt
    jobid: 238
    wildcards: BASE_OUTPUT_DIR=/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE, run_prefix=dropviz-hippocampus, annotation=nan

It would perhaps seem like one of your celltypes in the hippocampus_cellect.esmu.csv.gz have a nan as their celltype label and thus might messes up the script (perhaps because nan doesn't translate very well when converted to string? What's your take on this, @Tobi1kenobi?) and thus required files .bed files to proceed are not generated, hence snakemake cancels further job execution.

Sorry if this is inconvenient, but I might have to ask you to find the cell types in the CELLEX'ed dataset with the nan labels and rename them.

Tobi1kenobi commented 4 years ago

I'd agree, this sounds like #3 . It should've been fixed a long time ago but I think has been forgotten about, apologies.

But as @bengnielsen said, I'd expect renaming 'nan' as 'unknown' (for example) would resolve the issue.

DaianeH commented 4 years ago

Hi there,

Changing the name of the cluster from 'nan' to 'Unknown' worked.

So just to close this thread:

1 - CELLECT throws an error when the name of a cluster is "nan" 2 - You suggested I put a letter in front of the cluster numbers, but that doesn't make a difference when there's an hyphen in the cluster name (when there's numbers and hyphens, the cluster name is read as a string). However, I run CELLECT in another dataset with cluster names represented only as numbers, and it threw an error. When adding a letter in front of the cluster numbers, the error was gone. So CELLECT will throw an error when cluster names are numbers only (it must be a string).

Many thanks for your help!

pascaltimshel commented 4 years ago

Thanks @DaianeH !

To fix this the code should either : 1) fail explicitly (before staring Snakemake workflow) when number-only or nan encountered

or

2) improve snakemake regex(?) to allow for number-only or nan annotation names

perslab / CELLECT

Check expression specificity annotation names (ES header) #45