huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.03k stars 144 forks source link

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script. #235

Open barneylogo opened 4 months ago

barneylogo commented 4 months ago

Hello, Datatrove enthusiasts,

Nice to meet you all.

Recently, I've been working on the Datatrove library and I'm trying to run a sample script, process_common_crawl_dump.py from the following link: Datatrove GitHub.

I've made a couple of changes to the script: I've reduced the number of tasks from 8000 to 4 and updated randomize_start_duration to randomize_start. However, after running the script, I encountered some issues.

Here is the accounting history that I received: Accounting History

Additionally, I believe these logs are stored on my S3: S3 Log 1 image image I was expecting to get an output as a result, but there is no any output directories or files. only I got logs files Expected Output

For reference, here is my slurm.conf file: Slurm.conf

I've tried running the script multiple times, but I always get the same result. I'm not sure if this is the right place to ask for help, but I would appreciate any assistance from fellow Datatrove lovers.

Thank you!

hynky1999 commented 4 months ago

Hi, I can't see it from screenshot but what's the value of MAIN_OUTPUT_PATH ? The resulting files should be saved in {MAIN_OUTPUT_PATH}/base_processing/output/{DUMP_TO_PROCESS} not in the logs folder

barneylogo commented 4 months ago

Hi, @hynky1999 Thank you for your reply. Here is MAIN_OUTPUT_PATH image I mean, after I running script, I can't see output folder anywhere. If possible, can we use communication slot such as discord or telegram? my discord is barney49 and telegram is @raincoin5 I really hope to meet you

hynky1999 commented 4 months ago

Then can you check s3://data-refine/base_processing/base_processing/output/ if it cotntains any folders ?

barneylogo commented 4 months ago

there is no any output folder I only can see, logs folder, as I shared screenshot

hynky1999 commented 4 months ago

Strange so if you do aws s3 ls s3://data-refine/base_processing//base_processing/output/ you get no results ? (notice the double //

barneylogo commented 4 months ago

it cause error !

barneylogo commented 4 months ago

image

barneylogo commented 4 months ago

on aws, I only can see logs folder image

barneylogo commented 4 months ago

Hello @hynky1999 if you don't mind, can we discuss more details via discord or telegram? I really hope to solve this problem asap or, where I can find community? Thank you

barneylogo commented 4 months ago

hello @hynky1999 if possible, could you leave any messages? anyway, thank you for your help I really should to solve this problem

hynky1999 commented 4 months ago

Hey, we don't have any community forum as of right now. Could you send the logs you got please ? (not screenshots)

barneylogo commented 4 months ago

which logs?

barneylogo commented 4 months ago

I will send all files

barneylogo commented 4 months ago

hi @hynky1999 here is logs https://drive.google.com/drive/folders/1JjbxAKdsfgAaFm3H9Y_JsD-8O6MwoOmf also I only have logs folder in my aws account, but can't download it image

hynky1999 commented 4 months ago

Ahh, okay seems like none of the files get's throught extraction. Could you try increasing the timeout to 1 sec ? See https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/extractors/trafilatura.py#L26

barneylogo commented 4 months ago

I will try. thank you

barneylogo commented 4 months ago

hi @hynky1999 after I run this script, got new error logs. You can see from above google drive. 1051_0 ~ 1051_3 image

barneylogo commented 4 months ago

hello @hynky1999 actually, I am going to run this script https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py finally. but it doesn't give me perfect result could you give me some advice? I think you have idea

barneylogo commented 4 months ago

hi @hynky1999

I think server spec is problem Here is my cpu spec image or, this datatrove library need high gpu spec? or, my slurm configuration was wrong? if possible, could you let me know about this?

barneylogo commented 4 months ago

hello @hynky1999 how are you today?

hynky1999 commented 4 months ago

I am good thank you for asking :) It's not a slurm problem. How did you install datatrove ? From pip or from source ? Can you run following command and send output: pip freeze | grep numpy ?

barneylogo commented 4 months ago

pip install datatrove[all] I just thought it because python version. I was using python 3.12.4 so now I 've just reinstalled python into 3.10.12 and am installing datatrove again After done, I can send you. or, could you let me know which python version should I use? Thank you

hynky1999 commented 4 months ago

Yeah, we haven't released on pypi for a while thus we don't have locked dependency for numpy. Can you try installing the datatrove like this ? (from source) pip install 'datatrove[all]'@git+https://github.com/huggingface/datatrove

barneylogo commented 4 months ago

I will try. thank you so, you mean, any python version is ok?

hynky1999 commented 4 months ago

+3.10 should be fine

barneylogo commented 4 months ago

hello @hynky1999 I've installed datatrove like this pip install 'datatrove[all]'@git+https://github.com/huggingface/datatrove but script is running without error logs, I think we are not getting perfect result yet I've uploaded log files https://drive.google.com/drive/folders/1JjbxAKdsfgAaFm3H9Y_JsD-8O6MwoOmf If possible, could you let me know opinion about logs again? Thank you

hynky1999 commented 4 months ago

Hi, could you try processing more samples ? 10k+ ? (setting the limit variable in reader)

barneylogo commented 4 months ago

hello @hynky1999 how are you doing? I really hope to meet you can we use discord or telegram? if you don't want it, could you let me know your preferable? thank you

hynky1999 commented 4 months ago

Hi, I don't want to resolve this issue anywhere out of the gh issues. What's the state of your problem now ? Can't see any logs in the google drive folder you sent. PS: Could you post logs directly to this issue conversation next time ?

barneylogo commented 4 months ago

how are you @hynky1999 could you check this status? image also actually for first task I spent 10 hours. I have 6000 task totally. how to increase speed?

hynky1999 commented 4 months ago

So now you can see the output ? Re speed, there is not much you can do to speed up unless using more cpus/improving io