rafguns / wosfile

Handle Clarivate Analytics Web of Science™ export files
Other
26 stars 17 forks source link

Tag Y1 crashes the parser #19

Open Technologicat opened 8 months ago

Technologicat commented 8 months ago

When parsing a WOS file that has content in an "Y1" field, using wosfile.records_from, the parser crashes:

Traceback (most recent call last):
  File "/home/.../myscript.py", line 185, in <module>
    main()
  File "/home/.../myscript.py", line 50, in main
    for rec in wosfile.records_from(opts.filenames):
  File "/home/xxx/.local/lib/python3.10/site-packages/wosfile/record.py", line 118, in records_from
    yield Record(wos_record, skip_empty)
  File "/home/xxx/.local/lib/python3.10/site-packages/wosfile/record.py", line 27, in __init__
    self.parse(wos_data)
  File "/home/xxx/.local/lib/python3.10/site-packages/wosfile/record.py", line 39, in parse
    if is_splittable[field_name]:
KeyError: 'Y1'

The full list of tags you mentioned in #14 lists that tag too.

According to that list, of all possible things, "Y1" is Portuguese document title. Boggles the mind why specifically Portuguese has its own field, but there you have it.

One of my data files happens to have that field populated for some entries. Since the file contains thousands of entries, the simplest and quickest solution is to fix the parser.

For now, I just added this to wosfile/tags.py:

    ("Y1", "Portuguese document title", False, False),

A proper solution would be to either add support for all tags to tags.py, or to remove the need for that somehow.

rafguns commented 8 months ago

Thanks for the issue, I haven't encountered that field before. Does it come from some specific Web of Science index? I am most familar with the 'core' citation indexes, which don't use this field, as far as I'm aware.

Anyway, my current thinking is that we have to do two things:

Technologicat commented 8 months ago

I don't know where exactly it came from. It was in a bunch of datafiles a coworker sent me. We'd like to visualize a bunch of papers via their semantic embeddings (as well as possibly perform some other NLP analysis on titles and abstracts), so I'm building a custom tool to do that. The wosfile library is, obviously, parsing those input files that are in WOS format.

Exactly one of the files I got has at least one record that has Y1 populated. I'll ask for the details. To think of it, I could also print out the offending records to see if there's a pattern.

A warning for unknown fields sounds nice. For example, I don't exactly need the Portuguese document title, so I'd be fine parsing just the known fields correctly.

Technologicat commented 8 months ago

Ok, investigated. This is a data entry error. In exactly one record of exactly one of my datafiles, the field "Y1" is populated with author affiliations, which should be in the "C1" field.

FWIW, the full list of columns in that file is:

PT AU BA CA GP RI OI BE Z2 TI X1 Y1 Z1 FT PN AE Z3 SO S1 SE BS VL IS SI MA BP EP AR DI D2 SU PD PY AB X4 C1 Y4 Z4 AK CT CY SP CL TC Z8 ZB ZS Z9 SN BN WC UT PM

The other files I've tested also have the same columns. Since wosfile.record.Record.parse skips empty values, only this one record in this one file triggers the crash.

rafguns commented 8 months ago

Thanks for checking! I'll start working on this.