luzhang321 commented 3 years ago

Hi :) Recently I was using the sampleMetadata(curatedMetagenomicData 3.0.1). This metadata is really helping a lot.

Since it is already manually curated, can I directly use the metadata collected for my research?
I was confused with this dataset: CosteaPI_2017.

The paper refers to "Subspecies in the global human gut microbiome".

In their paper, they report "298 newly generated ones (Table EV1)". I download the tableEV1 and found after filtered PRJEB17632 & German samples. It returns 111 samples. However when I extract samples with same criteria in sampleMetadata, it returns 107 samples. 2.1 I am confused with the different sample numbers.

CosteaPI_2017 <- filter(sampleMetadata, study_name == "CosteaPI_2017") %>% filter(country == "DEU") # 31 individuals, 107 samples

CosteaPI_2017_Stable1 <- readxl::read_xlsx("CosteaPI_2017/msb177589-sup-0003-tableev1.xlsx", sheet = "Sheet1") CosteaPI_2017_Stable1_DEU <- filter(CosteaPI_2017_Stable1, ENA_study_accession == "PRJEB17632") %>% filter(., Cohort country == "German") waldronlab/curatedMetagenomicData#111 samples

2.2 And in their supplementary table, they didn't write healthy individuals. Why the disease column in sampleMetadata for these samples are "healthy"? Is that because you've other sources for curation?

Thank you! it would be great if some suggestions could be given.

Best Regards, Lu

lwaldron commented 3 years ago

@paolinomanghi can you check into this?

paolinomanghi commented 3 years ago

Hi @luzhang321, thanks for getting in touch!

I'm glad the metadata help your work!

Yes: using the curated metadata should avoid the problems that may arise curating them by yourself. Besides, the metadata in curatedMetagenomicData have been published several times.
I'm sorry if some of the numbers do not match the ones in the publication. I also mean to point clearly this out in the next paper. Some of the metadata might not be linkable with enough precision to their metagenomic samples, and this is one of the reason why their number differ from the numbers in the paper. It is also common that a bunch of samples result in an empty taxonomic profile. Since, for this release, I focused on giving a cross-section of the current state of metagenomic sequencing, I favoured the presence of a taxonomic profile rather a full respect of the original numbers. I hope this doesn't cause too much trouble.
we apply the healthy flag either when the samples are healthy or when the source allow to approximate the absence of serious diseases. We use to use in the past either healthy or none, but we opted for a simpler encoding. I understand this introduces a potential source of error and a further confusion. I also think anyways that this is reasonable approximation for the diseases which are the most represented in metagenomics. Of course this is not true for other less severe diseases which are less represented (e.g. asthma), or for the ones that are still not represented (depression, autism).

I hope this helps, Thanks for asking and keep keeping us posted.

I'm closing the issue

luzhang321 commented 3 years ago

Hi @paolinomanghi, thanks for your reply, it's very useful! :)

Hope it doesn't disturb you. I still have another 2 confusion.

dataset : FerrettiP_2018


FerrettiP_2018 <- filter(sampleMetadata, study_name %in% "FerrettiP_2018") %>% 
filter(body_site == "stool" & age_category %in% c("adult", "senior"))

FerrettiP_2018_ena <- read_tsv("FerrettiP_2018/filereport_read_run_PRJNA352475_tsv.txt") %>% # tsv is the file download from ena dplyr::filter(., !grepl("infant", experiment_title)) %>% # filter out infant and stool dplyr::filter(., grepl("stool", experiment_title))

setdiff(FerrettiP_2018$sample_id, FerrettiP_2018_ena$sample_alias)

[1] "CA_C10019IS2318FE_t1M15" "CA_C10039MS2669SA_t0M15"

2 more samples are marked stool adult samples in samplemetadata file



I  compared the sample name in ena file: (https://www.ebi.ac.uk/ena/browser/view/PRJNA352475?show=reads)
"CA_C10019IS2318FE_t1M15" : in ena file it was described as infant stool, but in samplemetadata it is adult stool
##### Illumina HiSeq 2500 sequencing; human metagenome: infant stool    Mother-infant microbiome vertical transmission  PRJNA352475 2318_t1 CA_C10019IS2318FE_t1M15.fastq.bz2 
"CA_C10039MS2669SA_t0M15": this sample in ena is also described as infant stool, but in samplemetadata it is adult stool
##### Illumina HiSeq 2500 sequencing; human metagenome: infant stool    Mother-infant microbiome vertical transmission  PRJNA352475 2669_t0 CA_C10039MS2669SA_t0M15.fastq.bz2

2. dataset: IjazUZ_2017
I understand some samples are missing, cause they might don't have taxonomy profiles. But there is one sample "S119_a_WGS". In the paper tables1, It is recorded as a healthy child.
![image](https://user-images.githubusercontent.com/46122020/129489493-c92f62c5-87de-4700-a908-ae53225e67d9.png)
in samplemetadata, it is recorded as an adult
![image](https://user-images.githubusercontent.com/46122020/129489511-84c4567c-6829-4f75-8045-3da3a9680b27.png)
Can I remove this sample out? 

Looking forward to your reply. Thanks in advance.
Best Regards,
Lu

paolinomanghi commented 3 years ago

Hi @luzhang321, thanks again for getting in touch.

Let' start with the Ferretti dataset: this is a dataset collected by our lab. I wasn't already here when the work was done, but I arrived just in time to add it to the cMD package: so, keeping in mind that errors are always likely and that I didn't carry out by myself all the metadata collections, here's what I did:

I double checked the sources I received from the first author about the metadata (I consider those more recent than the ENA submission, and more recent sources of metadata are often more corrected, in my experience. Besides, let's consider that the main metadata are often created and compiled by the main authors, while ENA or NCBI submission are often completed by side-authors with computational skills). In the raw sources I received from the first authors these samples were adults (stool).
I counted the number of species of these two samples (61 and 107), and I think that, given the average sequencing of six years ago these are typical depths of adult samples (of course this is my subjective evaluation). So from the actual numbers I would trust the metadata.
I looked at the taxonomic profiles: the one with 107 species (CA_C10019IS2318FE_t1M15) is a typical stool, female, healthy profile, with a lot of Alystipes, Akkermansia, Gordonibacter and I would never doubt is a regular adult gut, healthy microbiome. I wouldn't doubt this one.
Looking at the second one (CA_C10039MS2669SA_t0M15, 61 species), I'm a bit in a doubt: to me (again, personal impression) looks like a typical oral microbiome although is signed as a stool. So, I need some more time to check this.

Concerning the problem of Ijaz dataset: I confirm that the one you found is an error. I'll upload the corrected table now, and the package should start applying it a few weeks or so. In the meantime, I suggest you to manually annotate that sample as "child", and I'm very sorry for the inconvenience.

Bests, Paolo

luzhang321 commented 3 years ago

Hi @paolinomanghi, thank you for your reply! It's very helpful!

lwaldron commented 3 years ago

There is still an open issue here, until this change has been merged into sampleMetadata:

Concerning the problem of Ijaz dataset: I confirm that the one you found is an error. I'll upload the corrected table now, and the package should start applying it a few weeks or so.

schifferl commented 3 years ago

As this is a curation issue, rather than one relating to the software, I am transferring this issue to the curation repository. I've just attempted to pull in updates from the curation repo, there are none currently. If changes are ready by tomorrow morning (EST) / afternoon (CET), sampleMetadata will be updated automatically.

lwaldron commented 3 years ago

But curatedMetagenomicData contains software + data, and the sampleMetadata object is a part of curatedMetagenomicData, and not the curation repo. Not having an open issue on curatedMetagenomicData has the effect of obscuring an open issue from users of the package, and removes the public check and communication that this fix is actually merged into curatedMetagenomicData, defeating a couple of the main purposes of open issue tracking. Cross-listing in this instance would be acceptable since the change must be made in two repositories, but I can't understand the rationale for not wanting an open issue on curatedMetagenomicData.

lwaldron commented 2 years ago

@paolinomanghi it looks like there are two outstanding issues here:

[ ] Looking at the second one (CA_C10039MS2669SA_t0M15, 61 species), I'm a bit in a doubt: to me (again, personal impression) looks like a typical oral microbiome although is signed as a stool. So, I need some more time to check this.
[ ] Concerning the problem of Ijaz dataset: I confirm that the one you found is an error. I'll upload the corrected table now, and the package should start applying it a few weeks or so. In the meantime, I suggest you to manually annotate that sample as "child", and I'm very sorry for the inconvenience.

waldronlab / curatedMetagenomicDataCuration

CosteaPI_2017 disease status, healthy? #57

[1] "CA_C10019IS2318FE_t1M15" "CA_C10039MS2669SA_t0M15"

2 more samples are marked stool adult samples in samplemetadata file