Tag Y1 crashes the parser

Technologicat commented 8 months ago

When parsing a WOS file that has content in an "Y1" field, using wosfile.records_from, the parser crashes:

Traceback (most recent call last):
  File "/home/.../myscript.py", line 185, in <module>
    main()
  File "/home/.../myscript.py", line 50, in main
    for rec in wosfile.records_from(opts.filenames):
  File "/home/xxx/.local/lib/python3.10/site-packages/wosfile/record.py", line 118, in records_from
    yield Record(wos_record, skip_empty)
  File "/home/xxx/.local/lib/python3.10/site-packages/wosfile/record.py", line 27, in __init__
    self.parse(wos_data)
  File "/home/xxx/.local/lib/python3.10/site-packages/wosfile/record.py", line 39, in parse
    if is_splittable[field_name]:
KeyError: 'Y1'

The full list of tags you mentioned in #14 lists that tag too.

According to that list, of all possible things, "Y1" is Portuguese document title. Boggles the mind why specifically Portuguese has its own field, but there you have it.

One of my data files happens to have that field populated for some entries. Since the file contains thousands of entries, the simplest and quickest solution is to fix the parser.

For now, I just added this to wosfile/tags.py:

    ("Y1", "Portuguese document title", False, False),

A proper solution would be to either add support for all tags to tags.py, or to remove the need for that somehow.

rafguns commented 8 months ago

Thanks for the issue, I haven't encountered that field before. Does it come from some specific Web of Science index? I am most familar with the 'core' citation indexes, which don't use this field, as far as I'm aware.

Anyway, my current thinking is that we have to do two things:

[x] Go over that list again, and add any unknown tags to tags.py. I hadn't done that yet because it is hard to know if a field is 'splittable' and, if so, if it puts each item on a separate line or not (WoS is sadly inconsistent). But the current situation is even less ideal, so I'll just make an educated guess. (This in itself will already help a lot but since WoS continues to add tags, I think we need to do more to avoid the issue entirely, hence the rest of this list.)
[ ] Avoid raising an exception if a field is unknown, as detailed here.
[ ] If a field is unknown, one thing we might perhaps do, is issue a warning. Your program will continue to run but it gives a heads-up that a specific field might not be parsed in the most optimal way.

Technologicat commented 8 months ago

I don't know where exactly it came from. It was in a bunch of datafiles a coworker sent me. We'd like to visualize a bunch of papers via their semantic embeddings (as well as possibly perform some other NLP analysis on titles and abstracts), so I'm building a custom tool to do that. The wosfile library is, obviously, parsing those input files that are in WOS format.

Exactly one of the files I got has at least one record that has Y1 populated. I'll ask for the details. To think of it, I could also print out the offending records to see if there's a pattern.

A warning for unknown fields sounds nice. For example, I don't exactly need the Portuguese document title, so I'd be fine parsing just the known fields correctly.

Technologicat commented 8 months ago

Ok, investigated. This is a data entry error. In exactly one record of exactly one of my datafiles, the field "Y1" is populated with author affiliations, which should be in the "C1" field.

FWIW, the full list of columns in that file is:

PT AU BA CA GP RI OI BE Z2 TI X1 Y1 Z1 FT PN AE Z3 SO S1 SE BS VL IS SI MA BP EP AR DI D2 SU PD PY AB X4 C1 Y4 Z4 AK CT CY SP CL TC Z8 ZB ZS Z9 SN BN WC UT PM

The other files I've tested also have the same columns. Since wosfile.record.Record.parse skips empty values, only this one record in this one file triggers the crash.

rafguns commented 8 months ago

Thanks for checking! I'll start working on this.

rafguns / wosfile

Tag Y1 crashes the parser #19