Closed ltalirz closed 6 years ago
Totally agree and support your point of view. I'd although add the following point extra:
Well but that's basically your first point. The only difference is in the public statement.
Well, even if a dataset is proprietary, this does not prevent one from implementing a (access-restricted) API. But even if such an API is not present, if the whole database can be downloaded that's fine from my point of view.
How should we proceed? Should I make a pull request? Perhaps I would rename "contributing" to "guidelines" and include a section there describing the "machine-readable" part.
And would you like to keep "machine-readable" as a basic requirement or would you rather provide a "machine-friendly" sticker that highlights those entries which make an effort to be machine-readable?
Let's keep the machine-readable
criterion as a basic requirement? I think, it is crucial. On top of that, to my knowledge, all those mentioned datasets are (or were) investigated with the data science
methods.
Let's keep the machine-readable criterion as a basic requirement
Fine!
to my knowledge, all those mentioned datasets are (or were) investigated with the data science methods
Here it is not really clear to me what this means...
Some of the databases in the list can be downloaded, so that's fine. Some may have documented APIs for automated querying. But several also don't or am I missing something? What about Zeolite Structures Database, WURM, phonon database, NREL, ... I guess you can reverse-engineer the web forms quite easily, but where does one draw the line?
In essence, what I am looking for is the set of criteria that led you to the choice of the databases in the list (so that I know how to add to it).
OK, let me try to formulate...
@ltalirz I thought on your suggestion and ended up with the following. Any database is machine-readable by design. Only the access policies matter (and they aren't necessarily FAIR!). For instance, upon a private agreement, one may be granted an unrestricted access to a conservative, otherwise HTML-only data source.
After contacting some of the uncertain participants of my list, I received explicit or implicit requests for deletion. So why shouldn't we follow the canary principle? We just include anything we know was or would be of use for the mentioned or similar software frameworks and delete immediately by request.
Any database is machine-readable by design. Only the access policies matter (and they aren't necessarily FAIR!).
Agreed.
For instance, upon a private agreement, one may be granted an unrestricted access to a conservative, otherwise HTML-only data source.
We just include anything we know was or would be of use for the mentioned or similar software frameworks and delete immediately by request.
Do I understand correctly that you are proposing to include any potentially useful database, as long as they do not explicitly state (publicly or to us) that they are not open for machine-based data mining? I think this is a reasonable approach.
In this case, however, I would suggest two things:
Great!
Define a set of symbols (can even by just words for the moment) that identify for each entry of the list its data-mining openness (free / commercial / unknown)
There's proprietary label already. Its absence assumes the data are open.
somewhere (doesn't need to be on the main page) keep the list of databases that have explicitly been excluded (new proposals will be checked against this list)
OK, makes sense.
Since you are restricting the list to machine-readable datasets (and rightfully so, I would say), it would be very helpful to explain what this means, perhaps best using a few examples.
In practical terms: Many of these materials science efforts provide a HTML form, which connects to a database and spits out another HTML page with search results (possibly paginated). Should this count as machine readable? In principle, of course, all information made available in digital form can be considered machine readable, but then we can drop the requirement in the first place.
In my view:
What did you have in mind?
In the end, perhaps it is best to drop the requirement and rather put something like a FAIR sticker (or similar) to those entries that actually make it easy to query the data automatically.