pr-omethe-us / PyKED

Python interface to the ChemKED database format
https://pr-omethe-us.github.io/PyKED/
BSD 3-Clause "New" or "Revised" License
15 stars 15 forks source link

Should we include the title in the reference section? #55

Open bryanwweber opened 7 years ago

bryanwweber commented 7 years ago

Title question raised by Mike Burke's group at Columbia.

bryanwweber commented 7 years ago

My thought is that we shouldn't for two reasons

1) It doesn't add anything that we don't already have from the DOI 2) Validating that the title is correct is likely to be prone to error due to various encoding issues and have a bunch of edge cases with conversion of non-ASCII characters in the response from the DOI server.

@kyleniemeyer any thoughts?

kyleniemeyer commented 7 years ago

Yes, I agree. I don't see too much benefit to including it.

Though, this might lead to the question of whether we need anything beyond DOI...

Also, what if the reference doesn't have a DOI? In that case it might be advised to add a title field, but none of the reference info will be checked anyway.

On Jun 20, 2017, at 7:08 AM, Bryan W. Weber notifications@github.com wrote:

My thought is that we shouldn't for two reasons

It doesn't add anything that we don't already have from the DOI Validating that the title is correct is likely to be prone to error due to various encoding issues and have a bunch of edge cases with conversion of non-ASCII characters in the response from the DOI server. @kyleniemeyer any thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bryanwweber commented 7 years ago

Though, this might lead to the question of whether we need anything beyond DOI...

I think we need a full set of reference information, like would be published in a journal. Some (most?) journals don't include the title of the article in the references section. Also, including the authors field for the reference feels like giving credit where its due.

If the reference doesn't have a DOI... like for a report or something, maybe we should have a URL field? If the data isn't publicly available somehow, I don't think we should include it in the database at all. In either case, it still feels like the title field is redundant.

kyleniemeyer commented 7 years ago

I agree that we should probably only accept files when the reference is publicly available somewhere—I don't want to exclude conference papers that don't get turned into journal papers, though.

Thinking about this is leading to a chicken-and-egg problem in my head: ideally we want people (including us, or you at least) to create ChemKED files when they put a paper together, and perhaps include that as supplementary material with the submission. In that case, what do they put in the reference block? Just authors and a note about being under review? Perhaps the file-version should be 1.0alpha or something?

bryanwweber commented 7 years ago

I agree that we should probably only accept files when the reference is publicly available somewhere—I don't want to exclude conference papers that don't get turned into journal papers, though.

Does this include papers presented at, e.g. the US National Combustion Meetings, where the proceedings aren't published online? I'm inclined to not allow submissions of data from such meetings, because there's no way for someone who didn't attend to verify the data, and the data hasn't been peer-reviewed, which for all its faults, is still the minimum standard of acceptability.

I'm working on some files now for a paper; I'm putting the journal, year, and authors. Once its in-press, I'll add the DOI and submit it to the database. I'm not sure if I'll put the files in the supplementary material... If I do, I'll leave out the DOI (because I won't know it, I don't think), and I'll bump the file-version to 1 when I add the DOI and submit to the database. Then I'll bump it to 2 when I get a volume/issue/page.

kyleniemeyer commented 7 years ago

I think that if it came from a conference paper, at minimum the conference paper would need to be available on (e.g.) Figshare or something. I agree that we should prefer peer-reviewed data, but I also don't want to 100% exclude something potentially useful that didn't get published for some reason... not sure.

I'm working on some files now for a paper; I'm putting the journal, year, and authors. Once its in-press, I'll add the DOI and submit it to the database. I'm not sure if I'll put the files in the supplementary material... If I do, I'll leave out the DOI (because I won't know it, I don't think), and I'll bump the file-version to 1 when I add the DOI and submit to the database. Then I'll bump it to 2 when I get a volume/issue/page.

I definitely think we should encourage people to include the files as supplementary material, so that they are attached to the source paper. Not sure if you will have the DOI when it comes time to upload final materials for the paper, though.

bryanwweber commented 7 years ago

OK, perhaps the criteria is that it has to have a permanent identifier of some sort. But this discussion has gotten way off track (sorry, I got us off track :smiley:), and we should probably move the bits about the acceptability of data (or not) over to the ChemKED-database repo (and also write a wiki entry there on how to submit new data).

I think we agreed that title is not worth adding to the schema. If that's correct, feel free to close the issue (I just wanted to document the discussion for future reference).

kyleniemeyer commented 7 years ago

Yes, I agree we don't need to add it.

bryanwweber commented 7 years ago

From Mike Burke via email to Bryan:

With regard to the title, the value I see for having a title is that I can recognize what dataset it is by simply looking at the title rather than having to look up the paper based on the DOI. Could it simply be an optional item to specify? In my view, if one already specifies file authors, journal, etc., there seems little reason why a title would not be included.

bryanwweber commented 7 years ago

That's a reasonable use case. My concern is that validating that the title is correct (by comparing with the value from a DOI lookup) is bound to have many edge cases - for instance, some journals use HTML in their titles in the DOI service, while others don't. Having to code for all of these cases seems like it will lead to many false warnings.

The reason I'm insisting that we validate the title is correct is because we are trying, to the best of our ability, to ensure that we check that everything specified in the data file is correct according to some external standard. For instance, we also check the ORCID values for authors, if provided, to ensure the spelling of their names are correct, and we check the volume, issue, year, journal, and authors from a DOI lookup.

I'll look into testing this, picking say 100 random DOIs and seeing how accurate a relatively simple comparison will be. Reopening so I don't forget to do this.

bryanwweber commented 6 years ago

OK as I suspected, there are a number of differences in title formatting and such. However, it's not that difficult to print out a useful diff between the returned title and the title from the YAML, so I think this is workable. We might need to wait until #78 is resolved so that the diff can be shown to the user in a useful way.