inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

DESY FTP #73

Open kaplun opened 7 years ago

kaplun commented 7 years ago

During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.

I'd propose that the FTP is divided into one directory per feed.

@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?

ksachs commented 7 years ago

Sorry - misunderstanding.

For Elsevier, World Scientific, APS, PoS the publisher data are currently harvested at CERN. I don't know whether on legacy or labs. After the conversion CERN deposits INSPIRE-xml on the DESY FTP server and sends an email to desydoc@mail.desy.de. We need the DESY FTP server only as long as we do the matching/selection/merging via the DESY workflow.

Springer serves their data on their FTP server (ftp.springer-dds.com), no need to copy it to DESY when the harvesting will be done at CERN.

PTEP and Acta Physica Polonica B send emails with attachments. Is there a possibility at CERN to feed email attachments to a HEPcrawl spider?

Other emails are only alerts to trigger a web-crawl program. Again it would be nice if an email could trigger a HEPcrawl spider. For now we just process these journals at DESY. We don't have HEPcrawl spiders for those anyhow.

kaplun commented 7 years ago

I think the easiest thing would be that you indeed store those attachement into a share space such as the mentioned DESY FTP server.

For the triggers... Mmh... So, hepcrawl has indeed an interface to trigger a crawl, @david-caro might provide more information about it. Basically you could then send an HTTP POST request to hepcrawl to trigger the harvesting of the corresponding journal.

kaplun commented 7 years ago

http://pythonhosted.org/hepcrawl/operations.html#schedule-crawls

david-caro commented 7 years ago

Last week we agreed to create a simple interface to allow hepcrawl to harvest marcxml records from DESY, that way we are not hurried by the legacy shutdown to implement any DESY side flows, and that can be done calmly and bit by bit.

So in order to bootstrap that conversation, I propose to add a folder in DESY FTP with the records to harvest, and heprcawl will pick them up periodically.

The records should be separated in subfolders by source, so hepcrawl knows where they originally come from (springer, elsevier...).

What do you think?

ksachs commented 7 years ago

Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

david-caro commented 7 years ago

It needs the source of the crawl for various reasons:

But yes, having it in the metadata somehow might be enough, just proposed the directory structure for easy organization and implementation (50 dirs is not that many, and easily allows seeing if any provider source is empty or not being crawled properly, adding it to the metadata only means having to check the contents of the files every time you want to know something similar).

The key point being, we need a stable and reliable way of knowing the origin of the record.

-- David Caro david.caro@cern.ch

CERN - RCS-SIS inSPIRE-HEP High Energy Physics information system http://inspirehep.net On May 22, 2017 12:48 PM, ksachs notifications@github.com wrote:

Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/inspirehep/hepcrawl/issues/73#issuecomment-303065929, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAj1jPVAOtZIbBdbFQdoxIqQiBgsjEtrks5r8Wf7gaJpZM4KspYa.

ksachs commented 7 years ago

The origin of the record is 'DESY'

1) for display the journal might be more useful, fall-back 'DESY' or the publisher if it is in the metadata 2) matching: only relevant when the data are coming directly from the publisher, e.g. spinger crawler 3) for tracking purposes the source is DESY, the rest is our (=DESY local) problem including the question whether a publisher got 'stuck'.

This workflow via DESY can be a short term solution for the bigger publishers. Only for the small and infrequent publishers we will need it for a longer period. There it doesn't help to know the folder is still empty, this might be correct. Florian and I would suggest to leave the responsibility whether the harverst/conversion went fine with DESY and just process what is in the metadata.

kaplun commented 7 years ago

Ideally would be great to have the real source (i.e. the name of the publisher) so that later, when a crawler is ported from DESY to INSPIRE it is possible to compare apples with apples. As you might remember, in order to implement the automatic merging of a record update we need to fetch the last version for the corresponding source of the record that is being manipulated. If all the sources read DESY, then we you need to guarantee that you won't ever have the same publication coming through 2 separate sources that are then masked as DESY when they arrive to INSPIRE.

kaplun commented 7 years ago

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

@david-caro I think this should be good enough also for hepcrawl indeed to guess the source. After all the source doesn't need to be associated with one and only one hepcrawl-spider.

david-caro commented 7 years ago

Then how do we differentiate desy ones from non-desy ones?

ksachs commented 7 years ago

don't mix source (way to harvest) and publisher (metadata)

@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE. For big publishers the DESY-spider workaround is a short(!!!)-term temporary solution. Don't make it perfect. For small publishers - that's peanuts. We don't need to compare to previous version. In any case: it's DESY spider + DOI you can compare to.

@david-caro desy-spider -> source=DESY, publisher = whatever is in the metadata other spider -> non-desy

kaplun commented 7 years ago

@ksachs in inspire-schema we call source the origin of truth. I.e. the publisher. How things reach us has a sort of a lesser importance and it goes into acquisition_source.

@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE.

Sure but anyway we should start from somewhere, and updates from publishers will be most often about papers that reached us within the last year as preprint. So if we start to have clear data from now onwards, we are going to in regime in one year (i.e. much less pain for cataloger due to unresolved conflicts due to missing/untraceable history).

ksachs commented 7 years ago

maybe we are not talking about the same thing. A video meeting might be helpful. For arXiv: do you want to compare to another arXiv version or the update that comes from the publisher? For most preprints we don't get the publisher info from arXiv. If we do it can be publisher or journal.

ksachs commented 7 years ago

Is there a show-stopper if you just convert the marc to json as for existing INSPIRE records + acquisition_source = DESY?

david-caro commented 7 years ago

Ok, so in the end, the acquisition_source for records that are harvested by the desy spider will be:

"acquisition_source": {
    "method": "hepcrawl",
    "source": "desy"
}

And the data of the record will be exactly whatever is passed from desy (the output of dojson on the xml).

Anyone disagrees?

david-caro commented 7 years ago

And, the topic of the issue, the ftp will just be a folder with individual xml files, one per record. That will be removed upon ingestion (I recommend moving to a temporary dir that gets cleaned up periodically, though that should probably done on the server side if you want it, just in case we want to rerun anything).

kaplun commented 7 years ago

I am not sure one XML file per record is the easiest on DESY side. What about the possibility of grouping multiple records in on MARCXML file? (normally multiple MARCXML records are grouped into a <collection> ... </collection>

fschwenn commented 7 years ago

Right, it would be easier if we could pass on collections of records in a file.

david-caro commented 7 years ago

Hmm, then in order to parse them we would have to iterate for each record on every file... That might be messy on scrapy side.

-- David Caro david.caro@cern.ch

CERN - RCS-SIS inSPIRE-HEP High Energy Physics information system http://inspirehep.net

On Jul 3, 2017 14:12, Florian Schwennsen notifications@github.com wrote:

Right, it would be easier if we could pass on collections of records in a file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/inspirehep/hepcrawl/issues/73#issuecomment-312629322, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAj1jAbigPHToW2jRGeDYbr38QxUJUSbks5sKNpqgaJpZM4KspYa.

fschwenn commented 7 years ago

If needed, we can split the xml also on DESY side - no problem.

david-caro commented 7 years ago

No need, we can do on our side :), thanks!

Another question, the macxml files you provide will have files attached to them right? If so, what paths will they have? (so we can download them) @ksachs @fschwenn ^

fschwenn commented 7 years ago

The publishers where we get fulltexts will run via HEPCrawl. For all these smaller publishers for which we need the DESYmarcxmlSpider the only fulltexts are OA for which the xml would contain a weblink.

david-caro commented 7 years ago

There will be an overlapping time where some big publishers will still run on desy (springer for example), so we should support those too right?