inspirehep / inspire

Official repo of the legacy INSPIRE-HEP overlay
http://projecthepinspire.net
17 stars 20 forks source link

What's up with FFT%%? #276

Closed jacquerie closed 7 years ago

jacquerie commented 7 years ago

A few records have a bizarre FFT%% field. For example: https://inspirehep.net/record/1094156/export/xme.

As these are incomplete FFTs they currently generate a nasty exception in DoJSON (which is expecting an s code): https://sentry.cern.ch/inspire-sentry/inspire-nightly/group/619250.

What should I do with them? Just ignore them?

jacquerie commented 7 years ago

Forgot to cc @kaplun, but maybe @tsgit knows the answer too.

tsgit commented 7 years ago

woah -- that's a bad bibupload from

1094156.20140106152404 375595 hilu mode:replace file:/opt/cds-invenio/var/tmp-shared/bibedit-cache/bibedit_record_eV_TNm_1094156_32.xml

in 2014 not limited to xme format but also visible publicly in xm format and those are supposed to be hidden files.

how many are there? @jacquerie @kaplun

tsgit commented 7 years ago

hilu is a DESY cataloguer -- but not a developer, so something in the toolchain must have been wrong at the time

In [7]: from invenio.search_engine import get_collection_reclist

In [8]: heprecs = get_collection_reclist('HEP')

In [9]: badfft = set()

In [10]: for r in heprecs:
    ...:     xm = decompress(run_sql('select value from bibfmt where format="xm" and id_bibrec=%s' % r)[0][0])
    ...:     if xm.find('datafield tag="FFT" ind1="%" ind2="%"') > -1:
    ...:         badfft.add(r)
    ...:         

In [11]: len(badfft)
Out[11]: 139

In [12]: badfft
Out[12]: 
{1090369,
 1094156,
 1115831,
 1115876,
 1116124,
 1119996,
 1120518,
 1123359,
 1123523,
 1123802,
 1124579,
 1179996,
 1184387,
 1185409,
 1186735,
 1189002,
 1191014,
 1192965,
 1198033,
 1201900,
 1202269,
 1202491,
 1203072,
 1203155,
 1203366,
 1203846,
 1203875,
 1204492,
 1204547,
 1204945,
 1206327,
 1206352,
 1206843,
 1206884,
 1207442,
 1207630,
 1207641,
 1207869,
 1208106,
 1208623,
 1208733,
 1208807,
 1208884,
 1209405,
 1209447,
 1209840,
 1209910,
 1210054,
 1210064,
 1210447,
 1210689,
 1210692,
 1211366,
 1215306,
 1215337,
 1215587,
 1215612,
 1215782,
 1216303,
 1216535,
 1216603,
 1216672,
 1216887,
 1217117,
 1217362,
 1217696,
 1217710,
 1217741,
 1217858,
 1217862,
 1217981,
 1218030,
 1218290,
 1218345,
 1218357,
 1218393,
 1218995,
 1219065,
 1219075,
 1219249,
 1219311,
 1219343,
 1219346,
 1219970,
 1220252,
 1220253,
 1221009,
 1221061,
 1221062,
 1221074,
 1222146,
 1222686,
 1222841,
 1223359,
 1223860,
 1223990,
 1224160,
 1225546,
 1226021,
 1227658,
 1230983,
 1236870,
 1239650,
 1268877,
 1268878,
 1268879,
 1268880,
 1268881,
 1268882,
 1268883,
 1268884,
 1268885,
 1268886,
 1268887,
 1268888,
 1268889,
 1268890,
 1268891,
 1268893,
 1268894,
 1268895,
 1268896,
 1268897,
 1268898,
 1268899,
 1268900,
 1268901,
 1268902,
 1268903,
 1269029,
 1269030,
 1269031,
 1269032,
 1269033,
 1269034,
 1269035,
 1269036,
 1269037,
 1269038}
tsgit commented 7 years ago

@fschwenn any comments?

jacquerie commented 7 years ago

how many are there?

Sentry has 284 events of this kind. Since two full migrations have happened so far, this means 142 records.

kaplun commented 7 years ago

Yeah poor bibupload treated the % literally, which is IMHO correct when this is provided in the input MARCXML. If it was well designed it should have spit out an error of invalid character though.

tsgit commented 7 years ago

so to answer @jacquerie question what to do with them: ignore in migration

the affected records should have had publisher XML attached as hidden files, but due to incorrect MARCXML bibupload it wasn't. @fschwenn might know if the original source is still available and if so, attach the publisher XML correctly and remove the bad FFT%% tags from the records

I counted 139 affected records above, @jacquerie estimates 142 records.

jacquerie commented 7 years ago

ignore in migration

:+1:

The small discrepancy in the numbers can be explained by the fact that Nightly currently works on and old prodsync dump from September, so stuff might have changed in the meantime.

fschwenn commented 7 years ago

It seems the aps harvesting code had a bug in the early days. We at DESY did not notice immediately as during selection and curation the hidden files are not of interest anyway. The original fulltext.xml at /afs/cern.ch/project/inspire/uploads/aps start at 2014.06.05. I do not know why. Concerning reharvesting for a list of DOIs I fear the status is still that of one year ago, when Jan answered: "APS has a new API (v2) that should work better than the current one and return metadata directly, see e.g. http://harvest.aps.org/docs/harvest-api#general, but we do not have access yet (I just asked for it). Using the old/current API is a bit more tricky since the metadata (e.g. the abstract) is not so easily available - only as part of the XML in a Bagit archive. I could show you how, but I would recommend to wait and see if we can get access to the new API first."

jacquerie commented 7 years ago

The original issue is fixed by https://github.com/inspirehep/inspire-next/pull/1984. The new issue that @fschwenn brought up should probably go in its own issue in ...hepcrawl?

kaplun commented 7 years ago

It's part hepcrawl and part not code related issue. I.e. we need to investigate in order to obtain access to the API.

tsgit commented 7 years ago

can this be closed?

meanwhile I also removed all the bad MARC FFT entries

jacquerie commented 7 years ago

Yeah, this can be closed.