Closed pitman closed 12 years ago
During the sprint we noticed something similar when importing a RIS file. It was also related to date fields and how fields were being treated as dates in the ElasticSearch backend. We need to consider providing some way of detecting 'failed' imports and giving feedback to the UI.
yes, there is an issue with dates. if a field is previously identified as containing a date, it must contain dates in future (or vice versa). therefore as we cannot know the quality of the datasets, and cannot know all the key names that may come in, we should probably alter the importer to convert everything to strings during ingest.
This remains unsolvable in that we can only load one data type per key. However, the improvement of the parse functionality during the current backend-parse sprint should resolve by reporting errors. Therefore closing this ticket
OK, but I hope at least that users can be warned that certain keys are expected to have certain types of values, and if not that upload errors will occur. I wasted many hours trying to work around this bug in December, without knowing what the problem was. Now that you are aware what the problem is, please can the scope of it be posted somewhere for all users to see? Also BTW, where is the approved format for date fields specified?
Users cannot be warned beyond doing what is expected in BibJSON - it is so flexible precisely because it will map anything. BUT, the first time a key is mapped, it is fixed to that type from then on. So whoever uses a key first fixes the type. There is not a specific format for date fields, but if you put a date in a value, then its key will be mapped to the date type. In your example dataset, you had dates as well as non-dates in the "published date" key, so it was impossible for them all to successfully upload. Something like "2011" is not a date, but something like "2011/10/02" is. The bibjson examples use "year", "month", "day" rather than any sort of date - you could still put a date in a value if you wanted, and as you did, but if you do so you have to be consistent about it or some of the records will fail. The reason for failure should become clearer when we have parse reports.
The collection http://bibsoup.net/pitman/aldous2 contains only 129 records, whereas the source file http://bibserver.berkeley.edu/tmp/Aldous.json contains 223 records.
OK, it has been a pain to track down the cause of this, but I have some progress. It appears bibsoup is behaving badly if records have keys which contain spaces or capital letters.
Best results to date have been obtained by eliminating spaces and mapping all keys to lower case by kk = k.lower().replace(' ','_') See http://bibsoup.net/pitman/aldous4 which is identical to aldous2 except for this mapping. But this change in the source allows upload of 206 records instead of 129. Still, 17 records failing to upload for some reason. Note, ids in aldous4 have also been modified to eliminate spaces. This fixes the "No record found" bug described in another issue.
Further progress. http://bibsoup.net/pitman/aldous5 now includes all records. This was achieved by a) sanitizing all the keys, as above b) omitting the "Publication date" field, which even if the field name and field values are is sanitized is still causing grief.
Here is a reduced dataset of one record which exemplifies the bug and fails to upload:
http://bibserver.berkeley.edu/tmp/aldous6.json
Why should there be any objection to "publication_date" as a field name?
Here are two more examples which fail to upload completely (only 4 of 20 records)
http://bibsoup.net/pitman/euclid_test/
and http://bibsoup.net/pitman/euclid1_test/
The 16 records rejected look like simple article records to me. The ones accepted are proceedings items with html in the title. The rejected ones look simpler. I cant see what is causing this problem.
I dont think number field should be used like this. But need to provide some expanation of failure to upload records for various reasons. Silent failure is not good.