huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
242 stars 80 forks source link

CSV import of multiple Primary Documents is resulting in error #4633

Open llfinch opened 2 years ago

llfinch commented 2 years ago

It appears that something is going wrong when trying to import entities in bulk that have more than one Primary Document PDF. To test this, I prepared a simple CSV like so, separating the file names with a pipe symbol and no spaces:

image

and zipped the CSV up with the two mentioned PDFs. But upon import, Uwazi threw an error saying the Example1.pdf|Sample2.pdf file isn't found.

For background, I wanted to test this because in our current documentation, we don't explicitly say that it's possible to upload more than one Primary Document at a time. (We do explicitly say it's possible for Supporting Files, though.) For what it's worth, Jaume has already assessed that "It is trying to interpret the string as if that was the actual name of the file, including the pipe character."

txau commented 2 years ago

@llfinch would this solve your particular use case? https://github.com/huridocs/uwazi/pull/2340

This allows you to add one document per language. If you want to add more than one document to the same language it won't work.

RafaPolit commented 2 years ago

There is a way to upload multiple files. The way to actually upload multiple primary files is described here: https://github.com/huridocs/uwazi/pull/2340

The caveat is that you can only import one file per each language. The idea of "primary files" is to have translations of a single file. Not really "multiple files" although it CAN be used for multiple files. If you need to have multiple files in the same language, yeah, we need to develop that as a feature.

llfinch commented 2 years ago

Thanks, Rafa and Jaume! Just to check my understanding, when you say it can only import one file per language, do you mean language UI or do you mean language as in the document's label that is automatically assigned? I've been trying to test this out in my own instance, but for the life of me I can't get it to work as I see laid out in Joan's comment in https://github.com/huridocs/uwazi/pull/2340 I can import one primary document just fine, but the minute I try to do two on one entity (one doc in Spanish and one in English), it either only imports the metadata without documents or throws an error, despite trying several different scenarios (tweaking capitalization, regular CSV vs. CSV UTF-8, adding a Spanish interface to my instance, trying with just "title" and trying with "title_en" and "title_es", trying with the file names to have underscore language code and trying without...). Not sure what I'm doing wrong...

In any case, if it requires a matching language UI, then this wouldn't help my use case because there will only be at the time of import an English-language interface, but there will be potentially 100 different languages across all the documents, and we can't nor do we want 100 different language interfaces. So that would mean I can upload half of the documents at import, and then have to do several hundred documents manually, adding them to each entity, after the fact.

If it simply refers to the language of the document as automatically assigned, I think this would be workable for my use case at present since I believe the majority of pairing are different languages (mainly a country's national language and an English translation).

However, it's not hard to foresee a scenario where it wouldn't be enough. Think of documents that are mostly legislation, policy, official views, etc. Since legislation, policy, official views, etc. are periodically updated, the ideal would be keeping the older version on the same entity and then having the most up-to-date version as well, so a user could see the overarching "story" of the entity in the metadata (original adoption date, date of updating, details about process to pass it originally, details about why it was updated, etc.) and then explore the two different texts without having to get into the weeds of relationships and without having to download anything to their device (because they can use the built in document reader).

txau commented 2 years ago

@llfinch got it. We need to improve the feature so multiple files can be added via splitting file names with the | character.