Improve quality of provenance source data

tfmorris commented 3 years ago

I'm not sure where the web scraping code for the source data is located, so I'm flagging the issues here. Please feel free to move this to the correct repo.

There are a couple of issues which make the NLP tasks more difficult than they need to be (and they're already hard):

multiple lines are merged together with no newline character. These often represent multiple owners in the provenance, making the delimiters doubly important. For example, these three items cause the NER to produce an entity called "1958The Museum of Modern Art" when the 1958 belongs with the previous ownership record in the provenance.
The work name, artist name, and creation date are missing, all of which would be valuable context for parsing the provenance. e.g. "Léonide Massine (1896-1979), Paris / New York. Acquired from Léger, by 1935 - 1958" for a work that was painted by Ferdinand Léger in 1914.
Notes are conflated with the actual provenance (this may be a limitation of the source web sites)

parisdata commented 3 years ago

Thanks. Will move to comments/discussion.

I would reframe the issue as 1) improve robustness of the code so that it works for messy real world texts from all sources 2) provide tips for best results for users (any user can load provenance texts for analysis)

Concerning notes, trial runs showed that key names concerning provenance and the reliability of provenance statements often appear only in notes.

Laurel

On Tue, Oct 27, 2020 at 12:16 AM Tom Morris notifications@github.com wrote:

I'm not sure where the web scraping code for the source data is located, so I'm flagging the issues here. Please feel free to move this to the correct repo.

There are a couple of issues which make the NLP tasks more difficult than they need to be (and they're already hard):

multiple lines are merged together with no newline character. These often represent multiple owners in the provenance, making the delimiters doubly important. For example, these https://www.moma.org/collection/works/34908 three https://www.moma.org/collection/works/78420 items https://www.moma.org/collection/works/78788 cause the NER to produce an entity called "1958The Museum of Modern Art" when the 1958 belongs with the previous owner in the provenance.

The work name, artist name, and creation date are missing, all of which would be valuable context for parsing the provenance. e.g. https://www.moma.org/collection/works/78420 "Léonide Massine (1896-1979), Paris / New York. Acquired from Léger, by 1935 - 1958" for a work that was painted by Ferdinand Léger in 1914.

Notes are conflated with the actual provenance (this may be a limitation of the source web sites)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/parisdata/GLAMhack2020/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ7GQSHA2L4GC3LUEC25OX3SMX7MJANCNFSM4TABE6DA .

tfmorris commented 3 years ago

Of course having the code work on whatever inputs are available is better, but my point was that information which is available, such the boundary between different provenance items, shouldn't be thrown away. Similarly for notes, keep them, but make them separate since the web scraping program new that they were separate to start with.

Speaking of provenance, where does the software for rest of the processing pipeline (web scraping, provenance downloading, etc) live?

parisdata commented 3 years ago

Thanks for the contributions!

The gathering of provenance texts is not part of the data pipeline. They are user provided input. Those used in the Glamhack were gathered, using several different tools, from public websites. Each site is different and no one tool works for all. Texts were copied as is, with very few exceptions.

Any researcher or museum capable of creating a csv file is invited to download its own provenance data or use the provenance data already gathered.

For the Glamhack2020 data, the central file was gathered from the official AAM Nazi Era Provenance Internet Portal (NEPIP) to which 179 American museums contributed.

http://www.nepip.org/

NEPIP lists artworks that Museums identified as changing hands during the Nazi era, in many cases with provenance gaps.

The Glamhack file focuses on provenance for these artworks.

(For those interested, I'll be speaking about analysing NEPIP from a slightly different angle on November 3 Computing Impact: A methodology for identifying top NEPIP art dealers by frequency of mentions in provenance http://blog.apahau.org/colloque-en-ligne-beyond-borders-the-key-for-art-market-power-2-et-3-novembre-2020/ )

Concerning the inclusion of artists and creation dates: this is a good idea and was omitted not by design but because it demanded extra work to harmonize so many files. Will try to include in future...

Laurel

On Tue, Oct 27, 2020 at 3:34 PM Tom Morris notifications@github.com wrote:

Of course having the code work on whatever inputs are available is better, but my point was that information which is available, such the boundary between different provenance items, shouldn't be thrown away. Similarly for notes, keep them, but make them separate since the web scraping program new that they were separate to start with.

Speaking of provenance, where does the software for rest of the processing pipeline (web scraping, provenance downloading, etc) live?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parisdata/GLAMhack2020/issues/2#issuecomment-717287227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ7GQSBVIAH4UIT4S74RFF3SM3LBBANCNFSM4TABE6DA .

parisdata / GLAMhack2020

Improve quality of provenance source data #2