Missing Countries in four packages

stschiff commented 2 years ago

The following four packages have missing Country entries:

trident list --individuals -d . -j Country --raw | awk '$4 == "n/a"' | cut -f1 | sort | uniq -c
  4 2014_RaghavanScience
  20 2020_Nakatsuka_SouthPatagonia
  40 2021_Kilinc_northeastAsia
 383 2021_Wang_EastAsia
   4 Reference_Genomes

Obviously, the last one should have n/a, but the others should have proper Countries. Should be easy to fix by checking the original papers. @dhananjaya93 (@93Boy) perhaps you could get to that. Thanks.

93Boy commented 2 years ago

I will look into this

93Boy commented 2 years ago

2014_RaghavanScience was updated through #99 and 2020_Nakatsuka_SouthPatagonia was updated through #100

93Boy commented 2 years ago

383 Individuals of 2021_Wang_EastAsia does not contain in supplementary documents. It has only 169 newly reported ancient samples but Poseidon already has 191 samples with complete information other than 383 samples mentioned above. @AyGhal can you give me any hint regarding this?

nevrome commented 1 year ago

@AyGhal and I looked into this.

2021_Kilinc_northeastAsia is a bare bones package with almost no information in the .janno file. So we should add information way beyond just the Country. This information could be extracted either from the paper supplement or from the AADR.

Same is true for the modern samples in 2021_Wang_EastAsia. Information for these modern ones can be found in the HO version of the AADR dataset here.

93Boy commented 12 months ago

I have went through the AADR data set mentioned above and "2021_Kilinc_northeastAsia" has only 2 entries in AADR. From those 2 entries only "N2a" has a matching PoseidonID. But 2021_Wang_EastAsia has data for almost all the modern samples. I will upload the data.

AyGhal commented 12 months ago

All the individuals for "2021_Kilinc_northeastAsia" should be in AADR. Try looking for the publication "KilincSciAdv2021". They have added "_noUDG.SG" to the IDs.

93Boy commented 12 months ago

Got the information. I missed those entries since they were categorized under 2018 data, instead of 2021 in AADR

93Boy commented 11 months ago

I have added information partially in KilincSciAdv2021 via the PR #147. but I have encountered some confusing points while curating AADR data. Hope you can help me clearing

AADR has 2 Y_Haplogroup information. Y haplogroup (manual curation in terminal mutation format) and Y haplogroup (manual curation in ISOGG format) while later is more reliable according to google. Which one should I use?
Method of determining date is Direct: IntCal20. but the mean and the SD of data is suspicious.
There are numerous library types in a single entry. E.g. ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus Is this a normal situation?

stschiff commented 11 months ago

Thanks, @93Boy. Some replies:

Re Y-haplogroups. This is what the schema says: "please follow syntax with main branch + most terminal derived Y-SNP (e.g. R1b-P312)". Can someone advise whether that is actually ISOGG format? @AyGhal @TCLamnidis ?
I don't understand your Date question. I think that simply means the date type should be "direct", right @nevrome ?
Re libraries. Poseidon Schema to the rescue. As you can see here, Library_Built is a list-field and allows multiple entries, which should be consistent with Nr_Libraries. Please separate by semi-colon.

AyGhal commented 11 months ago

If you get the janno info from AADR_v54_1_p1_1240K_BeyondAncient-0.1.2 @nevrome has already converted it to our format. AADR_Y_Haplogroup_ISOGG is there. Also there are Library_Built and Nr_Libraries and that is the original AADR_Library_Type.

nevrome commented 11 months ago

What @AyGhal says.

The aadr-archive should already have everything according to my decisions with the code available here. Please note the .csv file I compiled with a summary of the anno2janno mapping here.

So to answer the concrete questions:

Y haplogroup (manual curation in terminal mutation format) is the one that fits to our requirements. To my understanding this is not the ISOGG format.
IntCal20 is the calibration curve. We don't have a column for this information. See the old age string parser script here for how I extract the age information from the AADR.
What the AADR summarises as library types is split across two columns in the .janno file: UDG and Library_Built. See the code to pull the information apart here.

If this does all make sense to you and you do not see any mistake in my code, then you can probably just copy the info from the respective aadr-archive packages, @93Boy.

93Boy commented 11 months ago

Y haplogroup (manual curation in terminal mutation format) is almost empty or doesn't have meaningful data in AADR but manual curation in ISOGG format has values. May I use these data? My concern about the UDG and Library type data is a single genetic_ID contains multiple library information. E.g: brn008_noUDG.SG ds.plus,ds.plus,ds.plus,ds.minus,ds.minus . I have not seen this kind of pattern in previous Poseidon data

stschiff commented 11 months ago

As discussed in a meeting, list data for libraries is supported by the schema. But it's not necessary to take this over from AADR for now. We are just keen to get the Country data and other missing data in for now.

poseidon-framework / community-archive

Missing Countries in four packages #96