Encoding problem with flixbus data.

ue71603 commented 2 months ago

Using: https://transport.data.gouv.fr/resources/11681?id=11681&locale=en

I added a reasonable feed_info.txt

However, the run of GtfsNeTEx breaks due to a character problem. More robustness there is needed:

C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\venv\Scripts\python.exe C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\GtfsNeTEx.py 
C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\venv\Lib\site-packages\xsdata\formats\converter.py:215: ConverterWarning: No converter registered for `<class 'numpy.int64'>`
  warnings.warn(f"No converter registered for `{data_type}`", ConverterWarning)
Traceback (most recent call last):
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\GtfsNeTEx.py", line 897, in <module>
    gtfs = GtfsNeTexProfile(conn=duckdb.connect(database='gtfs2.duckdb', read_only=True), serializer=serializer, full=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\GtfsNeTEx.py", line 879, in __init__
    self.incremental()
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\GtfsNeTEx.py", line 863, in incremental
    self.serializer.write(out, self.getPublicationDelivery(operators, [line], stop_areas, scheduled_stop_points, service_journeys, availability_conditions), self.ns_map)
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\venv\Lib\site-packages\xsdata\formats\dataclass\serializers\xml.py", line 76, in write
    handler.write(events)
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\venv\Lib\site-packages\xsdata\formats\dataclass\serializers\writers\lxml.py", line 64, in write
    self.output.write(xml)
  File "C:\Users\ue71603\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u010c' in position 4695: character maps to <undefined>

Process finished with exit code 1

skinkie commented 2 months ago

You can't make this robust, if the source data is not declaring things, or is mixing Latin-1 with UTF-8 or UTF-16. So lets first figure out what codec the original file is written in.

ue71603 commented 2 months ago

https://dev.to/bowmanjd/character-encodings-and-detection-with-python-chardet-and-cchardet-4hj7

ue71603 commented 2 months ago

@skinkie is import.py generated? Or can I do changes there?

ue71603 commented 2 months ago

change import.py

allow for shapes.txt not existing
write a simple feed_info.txt when it does not exist and import it.
add encoding to open with open('some.csv', newline='', encoding='utf-8') as f:

ue71603 commented 2 months ago

GTFS is not mandatory UTF-8. But it should be. and in Europe I think it must be: https://en.wikipedia.org/wiki/GTFS#:~:text=A%20GTFS%20feed%20is%20a,character%20encoding%20is%20UTF%2D8.

skinkie commented 2 months ago

@skinkie is import.py generated? Or can I do changes there?

Manually created. But what I mentioned before, we still should have something that adds optional columns, or even entire tables iff missing.

skinkie commented 2 months ago

GTFS is not mandatory UTF-8. But it should be. and in Europe I think it must be: https://en.wikipedia.org/wiki/GTFS#:~:text=A%20GTFS%20feed%20is%20a,character%20encoding%20is%20UTF%2D8.

https://github.com/google/transit/issues/444

skinkie commented 2 months ago

Encoding is now resolved.

skinkie / reference

Encoding problem with flixbus data. #8