Metadata collection - Githubissues

mckellardw / scMuscle2

Curation and analysis of skeletal muscle single-cell data across species

MIT License

2 stars 0 forks source link

Metadata collection #1

Open McKowen-JK opened 5 months ago

McKowen-JK commented 5 months ago

Hi David.

First of all let me compliement and thank you for your excellect research, code, and documentation.

I am beginning a similar meta-analysis, and would like to adapt align_snake to preprocess my data. So I am wondering exactly how the scMuscle2_metadata_v1-0.csv file was construted?

I am able to pull most of the information (but not all) using GEOquery via GSM number. Some of the columns such as "include" I assume were manually entered. My plan is to autogenerate as many coulumns as possible and deal with the remaining columns via brute force.

Please let me know if I am on the right track, or if your willing to share a simpler way to do this.

Much appreciated! -Keller

mckellardw commented 5 months ago

Thanks!

Unfortunately, there are not really standards in pace for metadata. And when I started, there weren't any command line tools for querying... So I actually went through all of the entries in SRA/GEO and manually copied over the metadata to a csv file. I think your strategy sounds much more reasonable!

One thing I wish I had done from the beginning was to use the standards put in place by cellxgene for metadata - documentation is somewhere on this site, can also be found on their github https://cellxgene.cziscience.com/docs/01__CellxGene

Also keep in mind that the compute for re-aligning these datasets is substantial- could take a few weeks to run depending on what resources you have.

Best of luck!

McKowen-JK commented 5 months ago

Hi David,

Thank you for the reply, and advice. Now I know I'm not just wasting time on this. I am having better success with the geofetch tool, which lets you get metadata from GEO and SRA. I will share what I come up with once i am satisfied.

I will look into the cellxgene documentation: https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data#dataset-requirements https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md

My question is how will changing/ adding to cellxgenes requirements effect the align_snake script? I suppose I can add new columns and incorporate them later in a scanpy pipeline. I agree that cellxgene compatibility is worth the effort.

-Best

mckellardw commented 5 months ago

The only section of the snakemake workflow where I remember the metadata being used is in the Snakefile under the Metadata and Pre-snake metadata prep sections. Just make sure that the column names that are used to index the dataframe match.

There may be other places downstream sections (you should be able to search the codebase for "META", which is the variable in the workflow containing all of the manually curated metadata. Happy to help if you keep posting questions.

Also, if you would like to fork/clone the align_snake pipeline as its own repo, I am happy to host that, and potentially contribute some updates for the workflow.

Best of luck, David