microbiomedata / pilot

0 stars 1 forks source link

Updating to nmdc-json_2019_02_19.zip #40

Closed jeffbaumes closed 4 years ago

jeffbaumes commented 4 years ago

This data update includes EMSL data.

kfagnan commented 4 years ago

Nice!

Though the size of the datasets went down from 30TB to 6TB - is that an issue with the units?

On Wed, Feb 19, 2020 at 7:41 PM Jeff Baumes notifications@github.com wrote:

Merged #40 https://github.com/microbiomedata/pilot/pull/40 into master.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microbiomedata/pilot/pull/40?email_source=notifications&email_token=AALPGD32YLM5VLVHZJBBBX3RDX3WVA5CNFSM4KYGNUEKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWYGDVRI#event-3054254789, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPGD6Y6LGS2WE2O5YDLP3RDX3WVANCNFSM4KYGNUEA .

jeffbaumes commented 4 years ago

While I originally used a more raw TSV file from @dehays for data object information, in the last week he performed a data object filtering/curation to just include three specific files for each GOLD project. This is what @wdduncan now includes in the official JSON data drops. It was after that process that the total dataset size went down. I believe the units are correct, and I'm fairly sure the size total went up with the EMSL data inclusion, so those data sizes are represented as well.

kfagnan commented 4 years ago

Yes, sorry. I chatted with Emiley after sending this and she mentioned the reduction in data that were included in the pilot.

On Wed, Feb 19, 2020 at 8:26 PM Jeff Baumes notifications@github.com wrote:

While I originally used a more raw TSV file from @dehays https://github.com/dehays for data object information, in the last week he performed a data object filtering/curation to just include three specific files for each GOLD project. This is what @wdduncan https://github.com/wdduncan now includes in the official JSON data drops. It was after that process that the total dataset size went down. I believe the units are correct, and I'm fairly sure the size total went up with the EMSL data inclusion, so those data sizes are represented as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/microbiomedata/pilot/pull/40?email_source=notifications&email_token=AALPGD723J7FCQBUYEH7ZPTRDYA6LA5CNFSM4KYGNUEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKV3ZY#issuecomment-588602855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPGD2MO3ZCCMXMKOG2WCDRDYA6LANCNFSM4KYGNUEA .

dehays commented 4 years ago

@kfagnan I was curious about this - details here: https://docs.google.com/spreadsheets/d/1nFyzFsbFaVClApOyDVVAp4uXuzyufxdRM0LqFpHitpE/edit

As @jeffbaumes indicated - the first JAMO file set I produced had 12586 files because it included a large number of annotation files. That sum of those file sizes was 35.4 Tb.

Retrieving only fastq, fna and faa produced a set of 3004 files with a total size of 5.3 Tb. The EMSL files only add another 1.9 Tb.

Other things I looked into - why are there a total of 873 MetaG and MetaT, but 913 projects processed by JGI? The answer is the 40 isolate genomes produced on the Wrighton study. The inclusion of isolate genomes may also be part of the reason we see more fastq, fna, and faa files than the 873 each that would be expected.

kfagnan commented 4 years ago

Ah this makes sense, thank you!

We should figure out the NMDC stance on isolates. I believe these are meant to be filtered out, though this gives me the ability to chat with collaborators about it next week.

On Thu, Feb 20, 2020 at 10:36 AM David Hays notifications@github.com wrote:

@kfagnan https://github.com/kfagnan I was curious about this - details here: https://docs.google.com/spreadsheets/d/1nFyzFsbFaVClApOyDVVAp4uXuzyufxdRM0LqFpHitpE/edit

As @jeffbaumes https://github.com/jeffbaumes indicated - the first JAMO file set I produced had 12586 files because it included a large number of annotation files. That sum of those file sizes was 35.4 Tb.

Retrieving only fastq, fna and faa produced a set of 3004 files with a total size of 5.3 Tb. The EMSL files only add another 1.9 Tb.

Other things I looked into - why are there a total of 873 MetaG and MetaT, but 913 projects processed by JGI? The answer is the 40 isolate genomes produced on the Wrighton study. The inclusion of isolate genomes may also be part of the reason we see more fastq, fna, and faa files than the 873 each that would be expected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microbiomedata/pilot/pull/40?email_source=notifications&email_token=AALPGD6ENEDU7QOJKMOV2ZLRD3EUBA5CNFSM4KYGNUEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPSILI#issuecomment-589243437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPGD7PSM6RVW6IPE63LQLRD3EUBANCNFSM4KYGNUEA .