tilde-lab / awesome-materials-informatics

Curated list of known efforts in materials informatics, i.e. in modern materials science
363 stars 84 forks source link

define machine-readable? #2

Closed ltalirz closed 6 years ago

ltalirz commented 6 years ago

Since you are restricting the list to machine-readable datasets (and rightfully so, I would say), it would be very helpful to explain what this means, perhaps best using a few examples.

In practical terms: Many of these materials science efforts provide a HTML form, which connects to a database and spits out another HTML page with search results (possibly paginated). Should this count as machine readable? In principle, of course, all information made available in digital form can be considered machine readable, but then we can drop the requirement in the first place.

In my view:

What did you have in mind?

In the end, perhaps it is best to drop the requirement and rather put something like a FAIR sticker (or similar) to those entries that actually make it easy to query the data automatically.

blokhin commented 6 years ago

Totally agree and support your point of view. I'd although add the following point extra:

blokhin commented 6 years ago

Well but that's basically your first point. The only difference is in the public statement.

ltalirz commented 6 years ago

Well, even if a dataset is proprietary, this does not prevent one from implementing a (access-restricted) API. But even if such an API is not present, if the whole database can be downloaded that's fine from my point of view.

How should we proceed? Should I make a pull request? Perhaps I would rename "contributing" to "guidelines" and include a section there describing the "machine-readable" part.

And would you like to keep "machine-readable" as a basic requirement or would you rather provide a "machine-friendly" sticker that highlights those entries which make an effort to be machine-readable?

blokhin commented 6 years ago

Let's keep the machine-readable criterion as a basic requirement? I think, it is crucial. On top of that, to my knowledge, all those mentioned datasets are (or were) investigated with the data science methods.

ltalirz commented 6 years ago

Let's keep the machine-readable criterion as a basic requirement

Fine!

to my knowledge, all those mentioned datasets are (or were) investigated with the data science methods

Here it is not really clear to me what this means...

Some of the databases in the list can be downloaded, so that's fine. Some may have documented APIs for automated querying. But several also don't or am I missing something? What about Zeolite Structures Database, WURM, phonon database, NREL, ... I guess you can reverse-engineer the web forms quite easily, but where does one draw the line?

In essence, what I am looking for is the set of criteria that led you to the choice of the databases in the list (so that I know how to add to it).

blokhin commented 6 years ago

OK, let me try to formulate...

blokhin commented 6 years ago

@ltalirz I thought on your suggestion and ended up with the following. Any database is machine-readable by design. Only the access policies matter (and they aren't necessarily FAIR!). For instance, upon a private agreement, one may be granted an unrestricted access to a conservative, otherwise HTML-only data source.

After contacting some of the uncertain participants of my list, I received explicit or implicit requests for deletion. So why shouldn't we follow the canary principle? We just include anything we know was or would be of use for the mentioned or similar software frameworks and delete immediately by request.

ltalirz commented 6 years ago

Any database is machine-readable by design. Only the access policies matter (and they aren't necessarily FAIR!).

Agreed.

For instance, upon a private agreement, one may be granted an unrestricted access to a conservative, otherwise HTML-only data source.

We just include anything we know was or would be of use for the mentioned or similar software frameworks and delete immediately by request.

Do I understand correctly that you are proposing to include any potentially useful database, as long as they do not explicitly state (publicly or to us) that they are not open for machine-based data mining? I think this is a reasonable approach.

In this case, however, I would suggest two things:

  1. Define a set of symbols (can even by just words for the moment) that identify for each entry of the list its data-mining openness (free / commercial / unknown)
  2. somewhere (doesn't need to be on the main page) keep the list of databases that have explicitly been excluded (new proposals will be checked against this list)
blokhin commented 6 years ago

Great!

Define a set of symbols (can even by just words for the moment) that identify for each entry of the list its data-mining openness (free / commercial / unknown)

There's proprietary label already. Its absence assumes the data are open.

somewhere (doesn't need to be on the main page) keep the list of databases that have explicitly been excluded (new proposals will be checked against this list)

OK, makes sense.