Closed yuoppp closed 1 year ago
Hi,
I downloaded the model from the link mentioned in the issue https://rusvectores.org/static/models/udpipe_syntagrus.model, and slightly updated your script to avoid the processing
module:
# coding: utf-8
from ufal.udpipe import Model, Pipeline
model = Model.load("udpipe_syntagrus.model")
process_pipeline = Pipeline(
model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu"
)
out = process_pipeline.process("чудо")
print(out, end="")
Then on Debian Stable with stock Python 3.9.2 I installed ufal.udpipe
and the above script produced
# newdoc
# newpar
# sent_id = 1
# text = чудо
1 чудо чудо NOUN _ Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing 0 root _ SpacesAfter=\n
I also tried docker image python:3.9
, I downloaded the above model using curl
, I installed ufal.udpipe
, and I again got
# newdoc
# newpar
# sent_id = 1
# text = чудо
1 чудо чудо NOUN _ Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing 0 root _ SpacesAfter=\n
So currently I am unable to replicate. Could you please send the docker image and the exact sequence of commands you used to obtain the above result? Thanks.
Hello, @foxik. Thank you very much for the fast reply. Here's everything I'm using when facing the issue
Files structure:
app
- Dockerfile
- test.py
- udpipe_syntagrus.model
Dockerfile
FROM python:3.9
COPY . .
RUN pip install ufal.udpipe
CMD ["bash"]
test.py
# coding: utf-8
from ufal.udpipe import Model, Pipeline
model = Model.load("udpipe_syntagrus.model")
process_pipeline = Pipeline(
model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu"
)
out = process_pipeline.process("чудо")
print(out, end="")
model sha256sum:
$ sha256sum udpipe_syntagrus.model
25bdf7e8c25d41c1e33bff1b8394ad9fe237e6cf8dc23ea83c582753e63ead21 udpipe_syntagrus.model
What I run:
$ docker build -t "test-udpipe" .
$ docker run -it test-udpipe
// inside the container
$ python test.py
# newdoc
# newpar
# sent_id = 1
# text = чудо
1 чудо чудо ADP _ _ 0 root _ SpacesAfter=\n
The host machine is MacbookPro on M1Pro running macOS Ventura 13.0.1
Also using Docker Desktop
for mac.
$ docker -v
Docker version 20.10.22, build 3a2c30b
Hi,
thanks for the information! When I follow the same procedure on docker on Linux, specifically
$ docker --version
Docker version 19.03.6, build 369ce74a3c
$ uname -a
Linux udpipe 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 x86_64 x86_64 GNU/Linux
then the result is "correct", i.e., it is
# newdoc
# newpar
# sent_id = 1
# text = чудо
1 чудо чудо NOUN _ Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing 0 root _ SpacesAfter=\n
So I guess this is somehow connected to the Docker running on macOS (I am not saying the error is not in UDPipe -- we could be using some C++ command with undefined behavior, which behaves "expectedly" on regular Win/Linux/macOS, but differently in the macOS docker). Unfortunately, I do not have access to any macOS with Docker (Macs are not very widespread here), so I do not know how to debug further.
@yuoppp can you double-check you are using the same (correct) model file in the docker? You can also parse more Russian sentences to see if the result is at least reasonable or completely wrong, which could help debugging the problem.
Hi, @martinpopel yes, I'm using the same model. I checked by comparing sha256 sums on locally and inside the container.
Right now I'm trying to run the image on cloud linux VM, so at least I'll be able to use the right version on prod env.
Seems like the problem affects only Geneder=Neut
words. I can't tell for sure, I just randomly checked some words.
Also, I can help you to debug, if needed. Just tell me where to write what in sources and how to run it, cuz I know nothing in c++
Well, seems like the macOS docker definitely breaks something.
I built an image on macOS but for linux/amd64
instead of the default linux/arm64/v8
When running the image on cloud VM the python test.py
prints correct value:
root@udpipe-test:/# python test.py
# newdoc
# newpar
# sent_id = 1
# text = чудо
1 чудо чудо NOUN _ Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing 0 root _ SpacesAfter=\n
Thanks for verifying!
I tried running both the amd64 and arm64 UDPipe binary under macOS (13.1) and they both work fine. I am now getting access to a macOS with docker (I should get it in a day or two) and I will continue debugging then.
Also, I can help you to debug, if needed. Just tell me where to write what in sources and how to run it, cuz I know nothing in c++
It is not obvious how to debug the problem, it will probably take quite some time, so I do not think I could do it noninteractively.
Today I got access to a macOS machine with Docker, and the good thing is that I can replicate the issue.
The problem is in the fact that by default, Docker on macOS uses linux/arm64
(aarch64
) platform -- however, UDPipe was never tested on this platform, and there is obviously some undefined behavior happening. You can also see that during build, the ufal.udpipe
is actually compiled (because we do not provide wheels for Linux ARM).
When you use linux/amd64
, for example using --platform linux/amd64
during building of the Docker image, UDPipe works fine (I was able to run this on macOS directly, I assume Rosetta is used for that?).
A good thing is that now I know the problem is in the Linux aarch64 platform -- I was able to replicate the issue without Docker, directly on Linux using qemu to run the aarch64 binaries. The Linux aarch64 therefore needs to be added to the supported platforms, which will happen, but might take some time (because it is difficult to estimate the time needed).
@foxik-admin Thank you so much!
Can confirm that docker build -t "udpipe-test" --platform="linux/amd64" .
works fine with macOS docker.
I have hopefully found the cause of the problem -- on aarch64
architecture, the char
is by default unsigned
, while it is signed
on all architectures we support. There was just a single place where we assumed it -- during the computation of FNV hash. I added an explicit cast to (signed char)
there, and after building I got the same results on your input, and also on other inputs, where before the fix the differences were very large.
The version 1.3.0, which will be released later today, will contain the fix.
MacOS Using venv with
Python 3.9.6
I didpip install ufal.udpipe
and the codeprints
The model loads like this:
which is correct. I know you're not familiar with my language, but "чудо" means "miracle", which is a noun.
Linux When I run exactly the same code inside a docker-container with linux the code prints
which is wrong, the word
чудо
does not exist as adposition.I tried every python3.9 official image from https://hub.docker.com/_/python and also tried to run it using the ubuntu:22.04 - nothing changes.
I hope this issue is related to https://github.com/ufal/udpipe/issues/51 and will be easy to resolve. I created a new one, because https://github.com/ufal/udpipe/issues/51 was about udpipe v1 and already closed.
UPD. I'm using this model: https://rusvectores.org/static/models/udpipe_syntagrus.model