Open Technologicat opened 8 months ago
Thanks for the issue, I haven't encountered that field before. Does it come from some specific Web of Science index? I am most familar with the 'core' citation indexes, which don't use this field, as far as I'm aware.
Anyway, my current thinking is that we have to do two things:
tags.py
. I hadn't done that yet because it is hard to know if a field is 'splittable' and, if so, if it puts each item on a separate line or not (WoS is sadly inconsistent). But the current situation is even less ideal, so I'll just make an educated guess. (This in itself will already help a lot but since WoS continues to add tags, I think we need to do more to avoid the issue entirely, hence the rest of this list.)I don't know where exactly it came from. It was in a bunch of datafiles a coworker sent me. We'd like to visualize a bunch of papers via their semantic embeddings (as well as possibly perform some other NLP analysis on titles and abstracts), so I'm building a custom tool to do that. The wosfile
library is, obviously, parsing those input files that are in WOS format.
Exactly one of the files I got has at least one record that has Y1 populated. I'll ask for the details. To think of it, I could also print out the offending records to see if there's a pattern.
A warning for unknown fields sounds nice. For example, I don't exactly need the Portuguese document title, so I'd be fine parsing just the known fields correctly.
Ok, investigated. This is a data entry error. In exactly one record of exactly one of my datafiles, the field "Y1" is populated with author affiliations, which should be in the "C1" field.
FWIW, the full list of columns in that file is:
PT AU BA CA GP RI OI BE Z2 TI X1 Y1 Z1 FT PN AE Z3 SO S1 SE BS VL IS SI MA BP EP AR DI D2 SU PD PY AB X4 C1 Y4 Z4 AK CT CY SP CL TC Z8 ZB ZS Z9 SN BN WC UT PM
The other files I've tested also have the same columns. Since wosfile.record.Record.parse
skips empty values, only this one record in this one file triggers the crash.
Thanks for checking! I'll start working on this.
When parsing a WOS file that has content in an "Y1" field, using
wosfile.records_from
, the parser crashes:The full list of tags you mentioned in #14 lists that tag too.
According to that list, of all possible things, "Y1" is Portuguese document title. Boggles the mind why specifically Portuguese has its own field, but there you have it.
One of my data files happens to have that field populated for some entries. Since the file contains thousands of entries, the simplest and quickest solution is to fix the parser.
For now, I just added this to
wosfile/tags.py
:A proper solution would be to either add support for all tags to
tags.py
, or to remove the need for that somehow.