ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
359 stars 75 forks source link

Linux aarch64 platform silently gives different results. #166

Closed yuoppp closed 1 year ago

yuoppp commented 1 year ago

MacOS Using venv with Python 3.9.6 I did pip install ufal.udpipe and the code

# coding: utf-8

from ufal.udpipe import Pipeline
from processing.model import Models

models = Models()
model = models.udpipe_model
process_pipeline = Pipeline(
    model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu"
)

out = process_pipeline.process("чудо")
print(out)

prints

# newdoc
# newpar
# sent_id = 1
# text = чудо
1       чудо    чудо    NOUN    _       Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing   0       root    _       SpacesAfter=\n

The model loads like this:

from ufal.udpipe import Model
...
self.udpipe_model = Model.load(self.udpipe_model_path)
...

which is correct. I know you're not familiar with my language, but "чудо" means "miracle", which is a noun.

Linux When I run exactly the same code inside a docker-container with linux the code prints

# newdoc
# newpar
# sent_id = 1
# text = чудо
1       чудо    чудо    ADP     _       _       0       root    _       SpacesAfter=\n

which is wrong, the word чудо does not exist as adposition.

I tried every python3.9 official image from https://hub.docker.com/_/python and also tried to run it using the ubuntu:22.04 - nothing changes.

I hope this issue is related to https://github.com/ufal/udpipe/issues/51 and will be easy to resolve. I created a new one, because https://github.com/ufal/udpipe/issues/51 was about udpipe v1 and already closed.

UPD. I'm using this model: https://rusvectores.org/static/models/udpipe_syntagrus.model

foxik commented 1 year ago

Hi,

I downloaded the model from the link mentioned in the issue https://rusvectores.org/static/models/udpipe_syntagrus.model, and slightly updated your script to avoid the processing module:

# coding: utf-8
from ufal.udpipe import Model, Pipeline

model = Model.load("udpipe_syntagrus.model")
process_pipeline = Pipeline(
    model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu"
)

out = process_pipeline.process("чудо")
print(out, end="")

Then on Debian Stable with stock Python 3.9.2 I installed ufal.udpipe and the above script produced

# newdoc
# newpar
# sent_id = 1
# text = чудо
1   чудо    чудо    NOUN    _   Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing   0   root    _   SpacesAfter=\n

I also tried docker image python:3.9, I downloaded the above model using curl, I installed ufal.udpipe, and I again got

# newdoc
# newpar
# sent_id = 1
# text = чудо
1   чудо    чудо    NOUN    _   Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing   0   root    _   SpacesAfter=\n

So currently I am unable to replicate. Could you please send the docker image and the exact sequence of commands you used to obtain the above result? Thanks.

yuoppp commented 1 year ago

Hello, @foxik. Thank you very much for the fast reply. Here's everything I'm using when facing the issue

Files structure:

app
    - Dockerfile
    - test.py
    - udpipe_syntagrus.model

Dockerfile

FROM python:3.9
COPY . .
RUN pip install ufal.udpipe
CMD ["bash"]

test.py

# coding: utf-8
from ufal.udpipe import Model, Pipeline

model = Model.load("udpipe_syntagrus.model")
process_pipeline = Pipeline(
    model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu"
)

out = process_pipeline.process("чудо")
print(out, end="")

model sha256sum:

$ sha256sum udpipe_syntagrus.model 
25bdf7e8c25d41c1e33bff1b8394ad9fe237e6cf8dc23ea83c582753e63ead21  udpipe_syntagrus.model

What I run:

$ docker build -t "test-udpipe" .
$ docker run -it test-udpipe

// inside the container
$ python test.py
# newdoc
# newpar
# sent_id = 1
# text = чудо
1       чудо    чудо    ADP     _       _       0       root    _       SpacesAfter=\n

The host machine is MacbookPro on M1Pro running macOS Ventura 13.0.1 Also using Docker Desktop for mac.

$ docker -v
Docker version 20.10.22, build 3a2c30b
foxik commented 1 year ago

Hi,

thanks for the information! When I follow the same procedure on docker on Linux, specifically

$ docker --version
Docker version 19.03.6, build 369ce74a3c
$ uname -a
Linux udpipe 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 x86_64 x86_64 GNU/Linux

then the result is "correct", i.e., it is

# newdoc
# newpar
# sent_id = 1
# text = чудо
1   чудо    чудо    NOUN    _   Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing   0   root    _   SpacesAfter=\n

So I guess this is somehow connected to the Docker running on macOS (I am not saying the error is not in UDPipe -- we could be using some C++ command with undefined behavior, which behaves "expectedly" on regular Win/Linux/macOS, but differently in the macOS docker). Unfortunately, I do not have access to any macOS with Docker (Macs are not very widespread here), so I do not know how to debug further.

martinpopel commented 1 year ago

@yuoppp can you double-check you are using the same (correct) model file in the docker? You can also parse more Russian sentences to see if the result is at least reasonable or completely wrong, which could help debugging the problem.

yuoppp commented 1 year ago

Hi, @martinpopel yes, I'm using the same model. I checked by comparing sha256 sums on locally and inside the container.

Right now I'm trying to run the image on cloud linux VM, so at least I'll be able to use the right version on prod env.

yuoppp commented 1 year ago

Seems like the problem affects only Geneder=Neut words. I can't tell for sure, I just randomly checked some words.

Also, I can help you to debug, if needed. Just tell me where to write what in sources and how to run it, cuz I know nothing in c++

yuoppp commented 1 year ago

Well, seems like the macOS docker definitely breaks something. I built an image on macOS but for linux/amd64 instead of the default linux/arm64/v8 When running the image on cloud VM the python test.py prints correct value:

root@udpipe-test:/# python test.py
# newdoc
# newpar
# sent_id = 1
# text = чудо
1   чудо    чудо    NOUN    _   Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing   0   root    _   SpacesAfter=\n
foxik commented 1 year ago

Thanks for verifying!

I tried running both the amd64 and arm64 UDPipe binary under macOS (13.1) and they both work fine. I am now getting access to a macOS with docker (I should get it in a day or two) and I will continue debugging then.

Also, I can help you to debug, if needed. Just tell me where to write what in sources and how to run it, cuz I know nothing in c++

It is not obvious how to debug the problem, it will probably take quite some time, so I do not think I could do it noninteractively.

foxik commented 1 year ago

Today I got access to a macOS machine with Docker, and the good thing is that I can replicate the issue.

The problem is in the fact that by default, Docker on macOS uses linux/arm64 (aarch64) platform -- however, UDPipe was never tested on this platform, and there is obviously some undefined behavior happening. You can also see that during build, the ufal.udpipe is actually compiled (because we do not provide wheels for Linux ARM).

When you use linux/amd64, for example using --platform linux/amd64 during building of the Docker image, UDPipe works fine (I was able to run this on macOS directly, I assume Rosetta is used for that?).

A good thing is that now I know the problem is in the Linux aarch64 platform -- I was able to replicate the issue without Docker, directly on Linux using qemu to run the aarch64 binaries. The Linux aarch64 therefore needs to be added to the supported platforms, which will happen, but might take some time (because it is difficult to estimate the time needed).

yuoppp commented 1 year ago

@foxik-admin Thank you so much! Can confirm that docker build -t "udpipe-test" --platform="linux/amd64" . works fine with macOS docker.

foxik commented 1 year ago

I have hopefully found the cause of the problem -- on aarch64 architecture, the char is by default unsigned, while it is signed on all architectures we support. There was just a single place where we assumed it -- during the computation of FNV hash. I added an explicit cast to (signed char) there, and after building I got the same results on your input, and also on other inputs, where before the fix the differences were very large.

The version 1.3.0, which will be released later today, will contain the fix.