Generation of new 1000g data

h3abionet / HPCBio-Refgraph_pipeline

0 stars 6 forks source link

Generation of new 1000g data #9

Closed cjfields closed 3 years ago

grendon commented 3 years ago

Yoruba files: metadata and data provenance. See attachments.

data_provenance_Yoruba.txt

metadata_Yoruba.txt

A local mirror of the Yoruba CRAM files is available on biocluster at this location:

/home/groups/h3abionet/RefGraph/data/1000genome/Yoruba/

grendon commented 3 years ago

Q: Which resources should we use to run the pipeline on biocluster? queue=normal account=h3bionet

cjfields commented 3 years ago

@grendon half the cores are open on HPCBio, maybe start it there and see how long the first batch of samples takes (run the report). We can estimate costs from that.

grendon commented 3 years ago

Samples have been divided into these groups: ESN, GWD, LWK, MSL, Yoruba. The first four groups in that list have fewer than 30 samples each. The Yoruba group has 99 samples. The pipeline is being run on each group and results are being arranged into separate folders for each group too.

cjfields commented 3 years ago

@grbot we have data ready to go; transfer of the data for the other hackathon participants can be possibly assigned as a task if needed.

grbot commented 3 years ago

@cjfields I see Gloria has an account on Ilifu. The name of the Ilifu GO endpoint is "Ilifu DTN" and she can use the credentials that was send in her welcoming email. Can we ask her to transfer the files to /cbio/projects/012/stream1/hupan/1kg-100-samples/uiuc please. What is the size?

grendon commented 3 years ago

The size of the transfer depends on what file(s) are needed on your end for analysis.

We analyzed almost 200 samples. The final results for each sample are the megahit assembly files + corresponding metrics. Those files are tiny, only a few KBases.

Intermediary results include unmapped reads pre and post qc-trim. Are those files needed too? They are larger files than the output files, roughly 1/10 of the size of the input file.

We also run multiqc on each group of samples. This file is also small.

grbot commented 3 years ago

Thank you @grendon

Please transfer everything you have because it might be useful in the comparison. The 1000 Genomes samples that we worked with is here . If you have those and can transfer them it would be great.

grendon commented 3 years ago

done