Open stschiff opened 2 years ago
I will look into this
2014_RaghavanScience
was updated through #99 and 2020_Nakatsuka_SouthPatagonia
was updated through #100
383 Individuals of 2021_Wang_EastAsia does not contain in supplementary documents. It has only 169 newly reported ancient samples but Poseidon already has 191 samples with complete information other than 383 samples mentioned above. @AyGhal can you give me any hint regarding this?
@AyGhal and I looked into this.
2021_Kilinc_northeastAsia
is a bare bones package with almost no information in the .janno file. So we should add information way beyond just the Country. This information could be extracted either from the paper supplement or from the AADR.
Same is true for the modern samples in 2021_Wang_EastAsia
. Information for these modern ones can be found in the HO version of the AADR dataset here.
I have went through the AADR data set mentioned above and "2021_Kilinc_northeastAsia" has only 2 entries in AADR. From those 2 entries only "N2a" has a matching PoseidonID. But 2021_Wang_EastAsia has data for almost all the modern samples. I will upload the data.
All the individuals for "2021_Kilinc_northeastAsia" should be in AADR. Try looking for the publication "KilincSciAdv2021". They have added "_noUDG.SG" to the IDs.
Got the information. I missed those entries since they were categorized under 2018 data, instead of 2021 in AADR
I have added information partially in KilincSciAdv2021 via the PR #147. but I have encountered some confusing points while curating AADR data. Hope you can help me clearing
Y haplogroup (manual curation in terminal mutation format) and Y haplogroup (manual curation in ISOGG format)
while later is more reliable according to google. Which one should I use?ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus
Is this a normal situation?Thanks, @93Boy. Some replies:
Library_Built
is a list-field and allows multiple entries, which should be consistent with Nr_Libraries
. Please separate by semi-colon.If you get the janno info from AADR_v54_1_p1_1240K_BeyondAncient-0.1.2 @nevrome has already converted it to our format. AADR_Y_Haplogroup_ISOGG
is there. Also there are Library_Built
and Nr_Libraries
and that is the original AADR_Library_Type
.
What @AyGhal says.
The aadr-archive should already have everything according to my decisions with the code available here. Please note the .csv file I compiled with a summary of the anno2janno mapping here.
So to answer the concrete questions:
Y haplogroup (manual curation in terminal mutation format)
is the one that fits to our requirements. To my understanding this is not the ISOGG format.IntCal20
is the calibration curve. We don't have a column for this information. See the old age string parser script here for how I extract the age information from the AADR.UDG
and Library_Built
. See the code to pull the information apart here.If this does all make sense to you and you do not see any mistake in my code, then you can probably just copy the info from the respective aadr-archive packages, @93Boy.
Y haplogroup (manual curation in terminal mutation format)
is almost empty or doesn't have meaningful data in AADR but manual curation in ISOGG format has values. May I use these data?
My concern about the UDG and Library type data is a single genetic_ID
contains multiple library information. E.g: brn008_noUDG.SG
ds.plus,ds.plus,ds.plus,ds.minus,ds.minus
. I have not seen this kind of pattern in previous Poseidon data
As discussed in a meeting, list data for libraries is supported by the schema. But it's not necessary to take this over from AADR for now. We are just keen to get the Country data and other missing data in for now.
The following four packages have missing Country entries:
Obviously, the last one should have
n/a
, but the others should have proper Countries. Should be easy to fix by checking the original papers. @dhananjaya93 (@93Boy) perhaps you could get to that. Thanks.