mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets
http://datacomp.ai/
Other
621 stars 49 forks source link

Problems in run train.py #85

Open JianchengZ opened 3 weeks ago

JianchengZ commented 3 weeks ago

Hello, I just got a problem. When I run: [torchrun --nproc_per_node 4 train.py --scale small --data_dir ./Data --output_dir ./Results/ --exp_name clip_score_train_results], I was told that: [from training.distributed import world_info_from_env ModuleNotFoundError: No module named 'training'], But I use pip or conda, I still can not have the module.

JianchengZ commented 3 weeks ago

Sorry, I have solved that problem above. But another problem still exits. The problem is : [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).

Could you please help to answer it ?

HariSeldon11988 commented 2 weeks ago

@JianchengZ

I have the same problem as you (from training.distributed ..... No module named 'training'). Can you tell me how you solved the problem? Would help me a lot :)

JianchengZ commented 2 weeks ago

@HariSeldon11988

Yes, here is the answer(commented by others): The training module comes from open_clip, and you can find the module in the open_clip repository if you are interested in looking at the source code: https://github.com/mlfoundations/open_clip/tree/v2.16.1/src/training.