omeka / plugin-CsvImport

Allows users to import items from a simple CSV (comma separated values) file, and then map the CSV column data to multiple elements, files, and/or tags. Each row in the file represents metadata for a single item. This plugin is useful for exporting data from one database and importing that data into an Omeka site.
20 stars 32 forks source link

Can't import utf8 file with unicode text #7

Open kintopp opened 12 years ago

kintopp commented 12 years ago

Omeka 1.5.1 and CsvImport v.1.3.3. Collated utf8 MySQL database. Fresh Omeka install.

If I import the bundled tests/test.csv file in the plugin all works correctly. Modifying this sample data to include umlauts works correctly. Modifying data to include Greek or Japanese text results in the test file not being imported. i.e. I'm returned to the import dialogue without having an opportunity to match fields. When the Japanese or Greek text is replaced with Roman text again the file is properly imported by the plugin once more. Test csv file opens as UTF-8 in BBedit and was saved again as such.

zerocrates commented 12 years ago

I have a sneaking suspicion this might be related to PHP's locale setting. Having the locale on your server set to something other than UTF-8 may be what's causing this, since the CSV-reading functionality we use is locale-sensitive (sometimes).

If this is the case, it's slightly tricky to fix on our end, since we can try to set a locale, but you need to give a language/region in addition to an encoding. We could make the assumption of en_US, but that's not going to work everywhere.

kintopp commented 12 years ago

Can you show me where to look for this? I'm not a developer... I'm using OSX (and testing under XAMPP) and locale returns:

LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL=

Looking in the XAMPP configuration overview, I see this under PHP Variables which might be relevant also:

_SERVER["HTTP_ACCEPT_LANGUAGE"] en,de;q=0.5

zerocrates commented 12 years ago

Those look like "right" locales (though they may not end up being the ones used by PHP). As a test, calling echo setlocale(LC_ALL, '0'); should tell you what PHP thinks its current locale setting is.

zerocrates commented 11 years ago

People seem to be reporting better luck using Firefox when uploading their CSV files. I'm not really sure how that could be affecting this, but several people have reported success with Firefox after failure from other browsers.

willynilly commented 11 years ago

I was not able to reproduce this bug on Chrome with the latest master. I tested Japanese, Chinese, Greek, and Vietnamese on an UTF-8 file using my Mac.

symac commented 9 years ago

@kintopp I know this is an old issue but I had the same problem today. And after different tries, I think the issue is with _validateSource in application/libraries/Omeka/File/Ingest/Url.php. The URL I have for files, which contains diacritics, do not validate via the Zend_Uri::factory.

I have been able to load the file by changing the Url from : http://geobib.fr/tmp/CPA/012-Vue_générale_prise_de_la_Petite-Perrière.jpg to : http://geobib.fr/tmp/CPA/012-Vue_g%C3%A9n%C3%A9rale_prise_de_la_Petite-Perri%C3%A8re.jpg

And this now works for me, so think it might be worth leaving this comment if somebody encounters the same issue.

@zerocrates @willynilly pinging you in case it makes sense for you and you think of a fix for this (I am leaving this file on my server for some weeks if you want to try to replicate with it don't hesitate)

zerocrates commented 9 years ago

I think this issue of UTF-8 in URLs is different than the usual problem that this issue represents, which is more about the encoding of the whole file and having the import not even progress to the mapping screen.

I'd have to think a little about whether we can handle this automatically... I wouldn't want to just always urlencode the URL before ingesting it, because that could mess up URLs that are already encoded. At a minimum we can document that URL encoding should be used. Maybe there's also some way to make Zend's validator accept these URLs.