Open barneylogo opened 4 months ago
Hi,
I can't see it from screenshot but what's the value of MAIN_OUTPUT_PATH
?
The resulting files should be saved in {MAIN_OUTPUT_PATH}/base_processing/output/{DUMP_TO_PROCESS}
not in the logs folder
Hi, @hynky1999
Thank you for your reply.
Here is MAIN_OUTPUT_PATH
I mean, after I running script, I can't see output folder anywhere.
If possible, can we use communication slot such as discord or telegram?
my discord is barney49
and telegram is @raincoin5
I really hope to meet you
Then can you check s3://data-refine/base_processing/base_processing/output/ if it cotntains any folders ?
there is no any output folder I only can see, logs folder, as I shared screenshot
Strange so if you do aws s3 ls s3://data-refine/base_processing//base_processing/output/
you get no results ? (notice the double //
it cause error !
on aws, I only can see logs folder
Hello @hynky1999 if you don't mind, can we discuss more details via discord or telegram? I really hope to solve this problem asap or, where I can find community? Thank you
hello @hynky1999 if possible, could you leave any messages? anyway, thank you for your help I really should to solve this problem
Hey, we don't have any community forum as of right now. Could you send the logs you got please ? (not screenshots)
which logs?
I will send all files
hi @hynky1999 here is logs https://drive.google.com/drive/folders/1JjbxAKdsfgAaFm3H9Y_JsD-8O6MwoOmf also I only have logs folder in my aws account, but can't download it
Ahh, okay seems like none of the files get's throught extraction. Could you try increasing the timeout to 1 sec ? See https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/extractors/trafilatura.py#L26
I will try. thank you
hi @hynky1999 after I run this script, got new error logs. You can see from above google drive. 1051_0 ~ 1051_3
hello @hynky1999 actually, I am going to run this script https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py finally. but it doesn't give me perfect result could you give me some advice? I think you have idea
hi @hynky1999
I think server spec is problem Here is my cpu spec or, this datatrove library need high gpu spec? or, my slurm configuration was wrong? if possible, could you let me know about this?
hello @hynky1999 how are you today?
I am good thank you for asking :)
It's not a slurm problem. How did you install datatrove ? From pip or from source ?
Can you run following command and send output: pip freeze | grep numpy
?
pip install datatrove[all]
I just thought it because python version. I was using python 3.12.4
so now I 've just reinstalled python into 3.10.12
and am installing datatrove again
After done, I can send you.
or, could you let me know which python version should I use?
Thank you
Yeah, we haven't released on pypi for a while thus we don't have locked dependency for numpy.
Can you try installing the datatrove like this ? (from source)
pip install 'datatrove[all]'@git+https://github.com/huggingface/datatrove
I will try. thank you so, you mean, any python version is ok?
+3.10 should be fine
hello @hynky1999
I've installed datatrove like this pip install 'datatrove[all]'@git+https://github.com/huggingface/datatrove
but script is running without error logs, I think we are not getting perfect result yet
I've uploaded log files
https://drive.google.com/drive/folders/1JjbxAKdsfgAaFm3H9Y_JsD-8O6MwoOmf
If possible, could you let me know opinion about logs again?
Thank you
Hi, could you try processing more samples ? 10k+ ? (setting the limit variable in reader)
hello @hynky1999 how are you doing? I really hope to meet you can we use discord or telegram? if you don't want it, could you let me know your preferable? thank you
Hi, I don't want to resolve this issue anywhere out of the gh issues. What's the state of your problem now ? Can't see any logs in the google drive folder you sent. PS: Could you post logs directly to this issue conversation next time ?
how are you @hynky1999 could you check this status? also actually for first task I spent 10 hours. I have 6000 task totally. how to increase speed?
So now you can see the output ? Re speed, there is not much you can do to speed up unless using more cpus/improving io
Hello, Datatrove enthusiasts,
Nice to meet you all.
Recently, I've been working on the Datatrove library and I'm trying to run a sample script,
process_common_crawl_dump.py
from the following link: Datatrove GitHub.I've made a couple of changes to the script: I've reduced the number of tasks from 8000 to 4 and updated
randomize_start_duration
torandomize_start
. However, after running the script, I encountered some issues.Here is the accounting history that I received:
Additionally, I believe these logs are stored on my S3: I was expecting to get an output as a result, but there is no any output directories or files. only I got logs files
For reference, here is my
slurm.conf
file:I've tried running the script multiple times, but I always get the same result. I'm not sure if this is the right place to ask for help, but I would appreciate any assistance from fellow Datatrove lovers.
Thank you!