Non-radiocarbon dates: the bazAARverse

nevrome commented 4 years ago

@yesdavid I saw that the 14cpalaeolithic database you added contains non-radiocarbon dates. c14bazAAR does not support these yet: We can't simply put them into the fields c14ageand c14std, that wouldn't make any sense.

For the moment I decided to remove them:

c14palaeolithic <- c14palaeolithic %>%
    dplyr::filter(method %in% c("AMS", "Conv. 14C"))

I think we could make this possible in the future, but this will require some changes in the architecture of the whole package.

Maybe the easiest solution would be dedicated S3 classes for each dating method: e.g. uratho_date_list, osl_date_list, dendro_date_list. Some functions that were developed for the c14_date_list could be applied to these objects as well, others not. There should be a superclass that allows to merge these dates despite their semantic differences.

I'm dependent on your input here, @dirkseidensticker and @MartinHinz. There are other possible solutions like extra packages for each kind of dates or -- much less ambitious -- some more columns for the c14_date_list.

Generally I'm a big fan of the Unix philosophy (Do One Thing and Do It Well), but in this case c14bazAAR already contains multiple functions that are useful for all kinds of dating information. On the other hand, for example dendro dates are a huge can of worms that we might want to avoid.

MartinHinz commented 4 years ago

From my perspective, it actually does not make so much sense to put OSL and Dendro into a package that is called c14bazaar. That would probably go much better into an oslbazaar and dendrobazaar package. Also, one could envisage extracting the filter functions (that might be useful for all the bazaars) into another package bazaarSanitizaar?

nevrome commented 4 years ago

I generally agree with you. My experience with neolithicRC was that somehow nobody considered it for Bronze Age dates. Weird :grin:.

On the other hand it's tedious to maintain and establish multiple packages. I would only invest this time if we have a solid number of expected users.

MartinHinz commented 4 years ago

Admittedly, multiple packages require a bit more coordination. But on the other hand, the amount of code to maintain should approx. stay the same, doesn't it?

nevrome commented 4 years ago

Exactly that's why I wanted to avoid the overhead of multiple packages. But you convinced me. We could call it the bazAARverse with the packages

c14bazAAR
oslbazaAAR
urathorbazAAR
isotopebazAAR
adnabazAAR
...

and the general helper package bazAAR.

Now we only need a team of 5 and 3 weeks of time :+1:. If only we could justify this investment...

MartinHinz commented 4 years ago

Can I detect a hint of sarcasm in those lines? ;-)

nevrome commented 4 years ago

Maybe a pinch. But seriously: This would be fantastic. A great standardization challenge, that could advance the field. Some discussions I recently shared with @stschiff inspired me to think in bigger dimensions again.

I only fear we're doing this mostly for ourselves at the moment. Is there are possibility that we reach the critical mass to make this an established tool in our field? Some developments in the last weeks give me hope, but I'm still wondering.

MartinHinz commented 4 years ago

This sounds mysterious. Nevertheless, starting with osl and dendro would be already a (maybe manageable?) enterprise and a 'leap forward'. Although, adna is also very tempting... Maybe some other parties might be interesting to join in if you agree on that? I will not name them just now, but I could think of some UK or Iberia based scientists who would also benefit from the interface and might invest some time to align it with existing or future packages. What do you think?

nevrome commented 4 years ago

I started to work on c14bazAAR because I profited from it for my own research projects (although the work went way beyond that at some points). I think this connection is crucial to do this in a feasible way.

@yesdavid Do you think you would need OSL (and/or Uran-Thorium etc.) datings for your research? If yes, could you imagine creating and maintaining such packages if you receive proper support from us? Maybe this is interesting for @felixriede as well?

@MartinHinz Are you in a position where this would apply to you concerning Dendro-datings? Are there even open (!) databases out there for this kind of data?

I could volunteer to coordinate the process and apply the necessary changes to c14bazAAR to detangle 14C related functions and general functions. I'm sure @dirkseidensticker would be on board as well.

Is this a good way to approach this? A good investment of our time? I see this as a long-term, slow-pace transformation.

felixriede commented 4 years ago

We can look into it, perhaps experiment with data import from existing db for the Palaeolithic, as part of @yesdavid's projekt. We'll put it on our to do list.

MartinHinz commented 4 years ago

I will volunteer for the dendro part! Count me in!

dirkseidensticker commented 4 years ago

What a great discussion! I see great value in the way we approached standardization within c14bazAAR, which could be translated to other kinds of data as well.

Exactly that's why I wanted to avoid the overhead of multiple packages. But you convinced me. We could call it the bazAARverse with the packages

c14bazAAR

oslbazaAAR

urathorbazAAR

isotopebazAAR

adnabazAAR

...

and the general helper package bazAAR.

Now we only need a team of 5 and 3 weeks of time 👍. If only we could justify this investment...

I am very much thrilled about such an approach! We would need to discuss how the logic we have already should/could be split. A lot of our efforts with regards to standartization were pointed at the metadata that are associated with 14C dates. Especially our approached towards 'thesaurification' are only scratching the surface as of yet.

@nevrome is right, a critical mass is important as well as a focus on research questions that benefit from such 'investments' of time and energy.

Two action points from me:

Should we introduce this as a possible topic for the hackathon at CAA?
How many datasets are out there? If most data are only published behind paywalls as supplementary to papers we would not get very far.

Btw: aDRAC contains a few OSL dates as well ... might be a good time to turn myself in 😉

nevrome commented 4 years ago

Especially our approached towards 'thesaurification' are only scratching the surface as of yet.

One interesting aspect @yesdavid brought forward: The amount of data we have manually compiled to simplify oddly specific sample material descriptions could be enough to try machine learning. I guess he was joking, but I would love to give this a try one day.

Should we introduce this as a possible topic for the hackathon at CAA?

I got the impression the topic for the Hackathon is already pretty fix. But as this might not be the last one of these events, I think this is a good idea.

How many datasets are out there? If most data are only published behind paywalls as supplementary to papers we would not get very far.

I think this is the kind of data where it might be the most easy to contact the authors and ask for a data publication on a long-term archive.

Btw: aDRAC contains a few OSL dates as well ... might be a good time to turn myself in :wink:

Off with his head! But seriously: We should check all databases for this: #92

MartinHinz commented 4 years ago

machine-learning sounds good! the only downside is, that it should be consistent for every user, and you can not trust the machines to do so on the client-side. But on 'server-side' this might be worthwhile

hackathon: not at CAA, but what about a virtual hackathon, or let's call it a sprint on the GitHub repo one day or the other?

paywall: who is not in with open science, is out.

nevrome commented 4 years ago

virtual hackathon

I like this idea. Maybe one day in February that we all try to shovel free to lay the foundation in a concerted attack.

dirkseidensticker commented 4 years ago

I am in!

stschiff commented 4 years ago

This sounds all great. Regarding aDNA data, just my five cents: This is so high-dimensional (a million genetic markers aren't uncommon) that it wouldn't fit into the exact same framework as the other data you have (C14, dendro, isotopes), which mostly come down to one number (plus extensive meta-information). One could of course think about summary stats (like Principal Components coordinates or something).

But I like the generic setup of these bazAARverse packages, which would basically try to offer a consistent API into such datasets, perhaps even with cross-compatibility of at least overlapping meta-info fields (say, longitude, latitude, or even somehow universal individual IDs that would link experimental data for the same individual burial).

MartinHinz commented 4 years ago

Thanks for the clarification, I think you are absolutely right. On the other hand, such things like Haplogroup, mt or Y, could be made accessible with that, or just a link for downloading the original high-dimensional data. Still not being very familiar with that topic, but eager to learn, I would already benefit from such a possibility.

stschiff commented 4 years ago

Yes, good point. Haplogroups could go into this, for sure, and they are already quite interesting. And of course, if we could even automatically download the full data somehow through a function, that would make people's life a lot easier. I think a lot of this depends on the development of an open and consistent data format for aDNA data and its meta-data, and we're working on that with @nevrome and others. So he's in the right position to help pushing this on this frontend side once we're making progress on the backend side.

ropensci / c14bazAAR

Non-radiocarbon dates: the bazAARverse #91