unofficial-memsource / memsource-cli-client

This is an unofficial memsource-cli-client project.
http://www.memsource-cli.org
Apache License 2.0
14 stars 2 forks source link

'charmap' error on bilingual download #15

Closed crashracer closed 4 years ago

crashracer commented 4 years ago

Running the bilingual download command generates a 'charmap' error:

$ memsource job download --type bilingual --project-id v013uZqRPSF1aFNPpWilvx --job-id VRLp9v1BAVgZgaUngvoHW1 --bilingual-format XLIFF 'charmap' codec can't encode characters in position 775-778: character maps to

Using Python3.6

ENV has: PYTHONIOENCODING=UTF8 chcp 65001

zerodayz commented 4 years ago

Hi @crashracer,

can you provide full traceback with added --debug to your job download command ?

The part responsible for downloading the files is: https://github.com/unofficial-memsource/memsource-cli-client/blob/d0c1c4741c92743c296b8238df2714266ba28000/memsource_cli/api_client.py#L531-L540

My suspicion is towards the encoding, maybe trying to use utf-8 encoding, maybe try adding encoding='utf-8' to open()

        if content_disposition:
            filename = re.search(r'filename\*=[\w\-]+[\']+([^\'"\s]+);',
                                 content_disposition).group(1)
            path = os.path.join(os.path.dirname(path), filename)
        try:
            with open(path, "wb", encoding="utf-8") as f:
                f.write(response.data)
        except:
            with open(path, "w", encoding="utf-8") as f:
                f.write(response.data)

This is likely not a solution, as we should avoid hardcoding the encoding to the code, however we can at least verify that is the case.

crashracer commented 4 years ago

traceback `'charmap' codec can't encode characters in position 775-778: character maps to Traceback (most recent call last): File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 537, in __deserialize_file f.write(response.data) TypeError: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Python3.6sci\Lib\site-packages\cliff\app.py", line 401, in run_subcommand result = cmd.run(parsed_args) File "C:\Python3.6sci\Lib\site-packages\cliff\display.py", line 116, in run column_names, data = self.take_action(parsed_args) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\job\v1\job.py", line 272, in take_action format=parsed_args.bilingual_format) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api\job_api.py", line 1159, in get_bilingual_file (data) = self.get_bilingual_file_with_http_info(project_uid, **kwargs) # noqa: E501 File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api\job_api.py", line 1245, in get_bilingual_file_with_http_info collection_formats=collection_formats) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 330, in call_api _preload_content, _request_timeout) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 169, in call_api return_data = self.deserialize(response_data, response_type) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 233, in deserialize return self.deserialize_file(response) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 540, in __deserialize_file f.write(response.data) File "C:\Python3.6sci\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 775-778: character maps to Traceback (most recent call last): File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 537, in __deserialize_file f.write(response.data) TypeError: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Python3.6sci\Scripts\memsource-script.py", line 11, in load_entry_point('memsource-cli==0.2.10', 'console_scripts', 'memsource')() File "C:\Python3.6sci\Lib\site-packages\memsource_cli\memsource.py", line 96, in main return memsource.run(argv) File "C:\Python3.6sci\Lib\site-packages\cliff\app.py", line 281, in run result = self.run_subcommand(remainder) File "C:\Python3.6sci\Lib\site-packages\cliff\app.py", line 401, in run_subcommand result = cmd.run(parsed_args) File "C:\Python3.6sci\Lib\site-packages\cliff\display.py", line 116, in run column_names, data = self.take_action(parsed_args) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\job\v1\job.py", line 272, in take_action format=parsed_args.bilingual_format) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api\job_api.py", line 1159, in get_bilingual_file (data) = self.get_bilingual_file_with_http_info(project_uid, **kwargs) # noqa: E501 File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api\job_api.py", line 1245, in get_bilingual_file_with_http_info collection_formats=collection_formats) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 330, in call_api _preload_content, _request_timeout) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 169, in call_api return_data = self.deserialize(response_data, response_type) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 233, in deserialize return self.deserialize_file(response) File "C:\Python3.6sci\Lib\site-packages\memsource_cli\api_client.py", line 540, in __deserialize_file f.write(response.data) File "C:\Python3.6sci\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 775-778: character maps to `

zerodayz commented 4 years ago

Hi @crashracer

Thank you for the traceback!

Here is the relevant error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 775-778: character maps to `

I wonder what character breaks it, if it's ` then I have tried reproducing it, but couldn't.

It might be as well related only to Windows because of the way how Windows handles it C:\Python3.6sci\Lib\encodings\cp1252.py

Could you change in memsource-cli-client/memsource_cli/api_client.py Line 535-540

        try:
            with open(path, "wb") as f:
                f.write(response.data)
        except:
            with open(path, "w") as f:
                f.write(response.data)

To

        try:
            with open(path, "wb", encoding="utf-8") as f:
                f.write(response.data)
        except:
            with open(path, "w", encoding="utf-8") as f:
                f.write(response.data)

Here is my try for reproducer:

$ cat testfile.md 
Foo`Bar
$ memsource project list
+------------------------+-------------+----------+---------------------+--------------------------+--------+------------+---------------------------+-------------+--------------+------------+-----------+
| uid                    | internal_id | id       | name                | date_created             | domain | sub_domain | owner                     | source_lang | target_langs | references | user_role |
+------------------------+-------------+----------+---------------------+--------------------------+--------+------------+---------------------------+-------------+--------------+------------+-----------+
| xjwi5RW1EEgKja5HGsZdD0 |         255 | 15274680 | Memsource Project 1 | 2019-11-06               | None   | None       | {"first_name": "Robin",   | en          | ['ja']       | []         | ADMIN     |
|                        |             |          |                     | 11:17:03+00:00           |        |            | "last_name": "Cernin",    |             |              |            |           |
|                        |             |          |                     |                          |        |            | "user_name":              |             |              |            |           |
|                        |             |          |                     |                          |        |            | "robincernin", "email": " |             |              |            |           |
|                        |             |          |                     |                          |        |            | r9n.developer@gmail.com", |             |              |            |           |
|                        |             |          |                     |                          |        |            | "role": "ADMIN", "id":    |             |              |            |           |
|                        |             |          |                     |                          |        |            | "380294", "uid":          |             |              |            |           |
|                        |             |          |                     |                          |        |            | "i0joEXVYjvh6821clw6Qm5"} |             |              |            |           |
+------------------------+-------------+----------+---------------------+--------------------------+--------+------------+---------------------------+-------------+--------------+------------+-----------+
$ memsource job create --file testfile.md --project-id xjwi5RW1EEgKja5HGsZdD0 --target-langs ja
+------------------------+--------+--------------------------+-------------+--------------+
| id                     | status | date_created             | filename    | target_langs |
+------------------------+--------+--------------------------+-------------+--------------+
| 2Q5to3F8DNpQ12rdmQbGD5 | NEW    | 2019-11-27T01:22:32+0000 | testfile.md | ja           |
+------------------------+--------+--------------------------+-------------+--------------+
$ memsource job download --type bilingual --project-id xjwi5RW1EEgKja5HGsZdD0 --job-id 2Q5to3F8DNpQ12rdmQbGD5 --bilingual-format XLIFF
+--------+-----------------------------------------------------------------+
| Field  | Value                                                           |
+--------+-----------------------------------------------------------------+
| type   | bilingual                                                       |
| format | XLIFF                                                           |
| path   | /var/home/rcernin/git/memsource-cli-client/testfile-en-ja-T.xlf |
+--------+-----------------------------------------------------------------+
$ cat /var/home/rcernin/git/memsource-cli-client/testfile-en-ja-T.xlf
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" xmlns:slr="urn:oasis:names:tc:xliff:sizerestriction:2.0" xmlns:memsource="http://www.memsource.com/xliff2.0/1.0" version="2.0" memsource:wfLevel="1" srcLang="en" trgLang="ja">
<file id="O9qvSAkcSBjSfTtn_dc4:0-0" memsource:taskId="O9qvSAkcSBjSfTtn_dc4" canResegment="no" original="testfile.md">
<slr:profiles generalProfile="xliff:codepoints"/>
<unit id="0" memsource:tGroupBegin="0" memsource:tGroupEnd="0">
<segment id="0" state="initial">
<source>Foo`Bar</source>
<target></target>
</segment>
</unit>
</file>
</xliff>
zerodayz commented 4 years ago

This traceback is thrown from Windows Python 3.6sci library

File "C:\Python3.6sci\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 775-778: character maps to `

@ludekjanda We need to figure out a way this works on Windows.

@crashracer I don't think you would have the same problem on Linux or MacOS. Would it be possible for you to try the same reproducer, I have sent in https://github.com/unofficial-memsource/memsource-cli-client/issues/15#issuecomment-558887873 ? Also I could have try to reproduce the same on Windows, but will need to know where you have got the Python3.6sci from and how is it installed in Windows.

zerodayz commented 4 years ago

@crashracer

Couldn't reproduce with Windows 10 and Python 3.6.0 from https://www.python.org/downloads/release/python-360/

I am thinking if the issue could be within the python or the way how the memsource-cli was installed?

For installation in Windows, I have done:

python -m pip install memsource-cli

Then changed my directory to C:\Users\admin\AppData\Local\Programs\Python\Python36\Scripts and run the following:

C:\Users\admin\AppData\Local\Programs\Python\Python36\Scripts>memsource.exe job list --project-id xjwi5RW1EEgKja5HGsZdD0 -f value -c uid
OwMWJLd5Pm9V7BdVYgqTf1
2Q5to3F8DNpQ12rdmQbGD5

C:\Users\admin\AppData\Local\Programs\Python\Python36\Scripts>memsource.exe job download --type bilingual --project-id xjwi5RW1EEgKja5HGsZdD0 --job-id 2Q5to3F8DNpQ12rdmQbGD5 --bilingual-format XLIFF
+--------+-----------------------------------------------------------------+
| Field  | Value                                                           |
+--------+-----------------------------------------------------------------+
| type   | bilingual                                                       |
| format | XLIFF                                                           |
| path   | C:\Users\admin\AppData\Local\Programs\Python\Python36\Scripts\t |
|        | estfile-en-ja-T.xlf                                             |
+--------+-----------------------------------------------------------------+

<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" xmlns:slr="urn:oasis:names:tc:xliff:sizerestriction:2.0" xmlns:memsource="http://www.memsource.com/xliff2.0/1.0" version="2.0" memsource:wfLevel="1" srcLang="en" trgLang="ja">
<file id="O9qvSAkcSBjSfTtn_dc4:0-0" memsource:taskId="O9qvSAkcSBjSfTtn_dc4" canResegment="no" original="testfile.md">
<slr:profiles generalProfile="xliff:codepoints"/>
<unit id="0" memsource:tGroupBegin="0" memsource:tGroupEnd="0">
<segment id="0" state="final">
<source>Foo`Bar</source>
<target>Foo`Bar</target>
</segment>
</unit>
</file>
</xliff>

I think there is a character in the file that is breaking it, my guess was ` however I was unable to reproduce with that character even on Windows. But I am using different Python Libs which may also cause the problem.

To find out what is going on, I will need your help.

zerodayz commented 4 years ago

I still can't find the abusing character, I have just run test file with the whole Window 1252 character set from https://en.wikipedia.org/wiki/Windows-1252

Worked well on: Fedora 31, Python 3.7.4 Windows 10, Python 3.6.0

crashracer commented 4 years ago

This fixed it:

        try:
            with open(path, "wb", encoding="utf-8") as f:
                f.write(response.data)
        except:
            with open(path, "w", encoding="utf-8") as f:
                f.write(response.data)

Thank you very much!

zerodayz commented 4 years ago

@crashracer Added into version 0.3.1 once released ~10 days. Until then available only in Github https://github.com/unofficial-memsource/memsource-cli-client/commit/449c79bc612363a14237ef5cfdd6f0c7587c4602