r-geoflow / geoflow

Tools to Orchestrate Geospatial (Meta)Data Management Workflows and Manage FAIR Services
https://github.com/r-geoflow/geoflow/wiki
Other
40 stars 14 forks source link

Unexpected bugs due to data encoding or others things like breakpoint in csv #378

Closed kikislater closed 5 months ago

kikislater commented 5 months ago

I write a csv from code using a programming language like python. Then I send the csv to another user. I don't know in which editor the user modifies the csv but the modified file results in an R breakpoint (red dot, screenshot available below) only visible in RStudio (gedit, geany, vscodium, nano do not display these breakpoints).

image File is available here: nosysakatia_zenodo-rawdata.csv @juldebar : this is your csv ^^

Finally, when I run a geoflow workflow with this csv, an error is generated. Removing the breakpoints allows the workflow to work, of course. To avoid this kind of error, it would be good to have a control to remove these breakpoints.

Finally, this made me think of something else. I'm used to receiving csvs or dbfs from shapefiles with different types of user and I then run into problems in different programming languages. It is common to include a dict to convert some data to other data. Perhaps by implementing this removal of "R breakpoints", it would be good to have a dictionary to handle bad encodings!

Example of hardcoding dict:

                    if "%20" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("%20", '-')
                    # Parenthèses
                    if "%28" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("%28", '')
                    if "%29" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("%29", '')
                    # Virgules
                    if "%2C" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("%2C", '')
                    # Apostrophes
                    if "%28" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("%28", '')
                    # non standard dash
                    if "%E2%80%93" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("%E2%80%93", '-')
                    # Apostrophe
                    if "&#039" in dkan_tag_name:
                        dkan_tag_name = dkan_tag_name.replace("'", '')

Sylvain