research-software-directory / RSD-as-a-service

This repo contains the new RSD-as-a-service implementation
https://research.software
27 stars 14 forks source link

Enable editing of the type of automatically scraped project impact/output #1292

Closed elboyran closed 1 month ago

elboyran commented 2 months ago

The automatic scraping showing the impact/output of a project/software is great (good work as a result of issue #1076)!

However, sometimes the assigned type is not correct.

Example:

A conference paper mentioning dianna (both project and software) has been automatically discovered and added as a book section, winch it isn't. I did not notice this addition and manually added the conference paper in Output for the project.

It would be nice to allow editing of the type (and maybe more fields?) of automatically scraped and added items (for impact/output, but maybe also in general?).

ewan-escience commented 2 months ago

So this is about https://doi.org/10.1007%2F978-3-031-63787-2_16, right? The metadata we use is from the Crossref API. There, its type is book-chapter, which we translate to bookSection internally.

I'm not a fan of overriding data from external sources like this. Why would a user know better than the source? What other fields should we make overridable then? It would affect every RSD page that has this as a mention.

I think a better approach would be to request the metadata to be changed at the source. Looking at this specific example, though, the website also talks about "Access this chapter" and "Buy Chapter" and refers to its parent as a "book".

elboyran commented 2 months ago

Hi @ewan-escience and @jmaassen , thanks for looking into my issue.

If you have noticed, just above and under the title of the paper it says

Conference paper

and

Cite this conference paper,

because that is what it is. I do not know why the Crossref API's type is book-chapter. The "book" here is the conference proceedings.

My 2 cents:

conference paper,

but as I said I did not know what metadata type that information resides.

My solution was to enter it all manually, but that is not efficient and prone to duplication as I described in the issue.

elboyran commented 2 months ago

Actually, I guess what chapter means here- the whole quarter of the proceedings (as you see on the picture of the proceedings front page- it says Part I)! The organizers shared with us the proceedings consisting of 4 PDFs.

But I want to refer to only this paper, not the whole chapter!

ewan-escience commented 2 months ago

maybe another metadata type reflects the true type?

No, type is the only metadata field reflecting this, as far as I know. You can see all existing types here. (And the link to the Crossref API is now corrected in my previous post, where you can see all the metadata that Crossref has of this work.)

as to requesting changing the source, who should be doing this and to whom?

The one who wants it to be changed, which would be you in this case. 🙂 I would start with contacting SpringerLink and see if they can further help you.

Manually changing the wrong type seems much less effort than finding out and contacting external people

Allowing for people to change harvested metadata is bound to give more trouble in the long run (the rest of the RSD team agrees with this). And it's not a trivial change, since we re-harvest mentions often, we would need to keep track of which fields were manually overwritten. And it would change entries like this on all pages on the RSD where they are also present, where others might not agree with the manual changes.

I'm not sure what you mean with your latest post. The DOI I posted before links to a page where I can buy one PDF, so I'm not sure where you encountered 4 PDFs.

elboyran commented 2 months ago

@ewan-escience, @jmaassen I am saying that the type you are using is for 1 "chapter" of the proceedings (which consists of total of 4 "chapters", PDF files!), where the paper resides, not the paper itself. The link does point to a site stating conference paper right? So, this should be reflected somewhere in the metadata and retrieved correctly by your scraping.

From the same DOI following

Cite this conference paper

I have downloaded the BIB reference and imported it in Zotero. There the Item Type appears as Conference Paper.

Can you, please, add my github ID in your potential response, otherwise I'm not notified.

ewan-escience commented 2 months ago

@elboyran I agree with you that it's actually a conference paper. However, I don't see this reflected anywhere in the metadata. If you see it in the metadata, please let us know, so that we maybe can incorporate it in our harvesters. Otherwise, you'd really have to contact the publisher to get the metadata changed.

elboyran commented 2 months ago

@ewan-escience As I wrote above, I saw it in the metadata field Item Type of the citation files (e.g. the BIB one) accessible from the DOI link. I have no idea how to get there programmatically, though...

ewan-escience commented 2 months ago

@elboyran we only use the metadata from Crossref, in this case from here. We are not going to be harvesting and parsing BibTeX or other files.