"skipped due to small size" error

martinjzhang / scDRS

Single-cell disease relevance score (scDRS)

https://martinjzhang.github.io/scDRS/

MIT License

98 stars 11 forks source link

"skipped due to small size" error #54

Closed NaomiHuntley closed 1 year ago

NaomiHuntley commented 1 year ago

Hello. I am working on computing scDRS scores for several different traits, however I keep getting the error that the trait is being skipped due to small size for all the different traits that I try: Screenshot 2023-03-20 at 10 42 30 AM

Here is the head of the file I am testing the code on: Screenshot 2023-03-20 at 10 43 14 AM

Thank you!

martinjzhang commented 1 year ago

Hi,

The format of the .gs file is expected to be gene1:weight1,gene2:weight2,..., where gene1, gene2, etc. are gene names. However, in your file, gene1 appears to be represented by the number 130576, which is not a gene name (like Cd4). As a result, none of the genes in your .gs file appear to be present in your .h5ad file, which means that scDRS will not be able to process the data correctly. To ensure that scDRS can work with your data, please check that the genes listed in your .gs file match the gene names present in your .h5ad file.

NaomiHuntley commented 1 year ago

Thank you for the quick reply. It seems that the numbers come from the magma gene analysis step. This is my first time using magma, so do you happen to have any insight as to why this would happen?

Thank you!

martinjzhang commented 1 year ago

@KangchengHou could you help with the MAGMA question? Thanks

KangchengHou commented 1 year ago

@NaomiHuntley in the MAGMA directory, there is a file <MAGMA_DIR>/NCBI37.3.gene.loc which contains the correspondence between gene number and gene symbol. Will add this information to https://github.com/martinjzhang/scDRS/blob/master/docs/compute_magma_gs.md Please let me know any questions

NaomiHuntley commented 1 year ago

Hi @KangchengHou - as I am new to this, I am not entirely sure how to map the gene numbers to the symbols. I looked through the documentation for magma, but it seems that is not something I can do in magma. Is there a different tool? Thanks in advance for the clarification!

martinjzhang commented 1 year ago

HI @NaomiHuntley, NCBI37.3.gene.loc is a .tsv file whose first column is the numbers and last column is the gene names. You need to write a small script (e.g., in R or Python) to do the mapping, changing the numbers to the corresponding gene names.

KangchengHou commented 1 year ago

@NaomiHuntley Alternatively you can try with the following to modify NCBI37.3.gene.loc that was used to run MAGMA

# switch the 1st column and the 6th column
awk '{OFS="\t"; print $6,$2,$3,$4,$5,$1}' NCBI37.3.gene.loc > NCBI37.3_symbol.gene.loc

And for MAGMA step1, use the following (note the replaced gene-loc file)

${magma_dir}/magma \
    --annotate window=10,10 \
    --snp-loc ${magma_dir}/g1000_eur.bim \
    --gene-loc ${magma_dir}/NCBI37.3_symbol.gene.loc \
    --out out/step1

This should be more convenient. Please let us know how this works

NaomiHuntley commented 1 year ago

@KangchengHou Thank you so much for that explanation. I am new to all of this so that helped a lot. I just submitted the gene annotation step, which took a few days last time. I will post an update when it works or if there are any more problems.

NaomiHuntley commented 1 year ago

@KangchengHou @martinjzhang Thank you for the help! This code worked well for me and corrected my issue.