sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Unicode right-to-left character encoding issue #344

Open jacobthill opened 1 year ago

jacobthill commented 1 year ago

Some Arabic strings have an unencoded character in them: e.g.

\u200f is a non-printing right-to-left unicode character. It is in the original data but is not visible when rendered. We didn't have an issue with this before but with the new airflow process, this character isn't getting encoded. The issue is likely in intake. We need this character to render properly or we need to remove it. I'm not sure the implications of removing it but here is a way to do that:

https://stackoverflow.com/questions/46897952/remove-right-to-left-character-u200f-in-python-hebrew

jacobthill commented 1 month ago

I suspect this might be solved be retransforming the collect (now from json instead of csv) and refreshing the data.