Closed urialon closed 4 years ago
It actually seems that this loop: https://github.com/typilus/typilus/blob/master/src/data_preparation/scripts/prepare_data.sh#L58 Is completely parallelizable, right?
Hi @urialon Thanks for looking into this!
The situation is a bit complicated here. The output you're seeing is from pytype
which does type inference for data augmentation. The 121/411
means that it has inferred types for 121 of the 411 dependent packages of a single project.
The good news is that pytype caches the results for the dependent packages, which means that they are reused for other projects. The bad news is that it seems to me that pytype cannot run in parallel for many projects (the parallelization you suggest) as it will introduce conflicts in the cache (...). Also, pytype is single-threaded...
Others have also complained to me that the extraction takes too long. When I extracted the data it took about 10 days, but this doesn't seem to match the experience that others had, which is puzzling... (could it be that the Azure Fsv2
are so much faster?)
I think that this is an important issue (given that we cannot practically share the extracted data), so I see three options:
pytype -V3.6 --keep-going -o ./pytype -P $repo infer $repo
pytype
at all (delete L57-L72)This won't replicate the exact corpus we used, of course. On the other hand, we never measured the impact that pytype
has, which might be small.
A higher-level comment here is that the data situation in our field is problematic, due to legal constraints. It'd be nice to figure out a reasonable way to fix this...
Thanks for the quick response!
I'll wait another week or so, and if it doesn't stop I'll see what happens without pytype
at all.
Let me know how this goes. It important to make sure that the data extraction can be replicated.
still running...
That's very confusing :(
I suggest that we either remove pytype and rerun all the models in the new dataset or we try a newer version of pytype (here). The first option seems the most reasonable, since it will speed-up the extraction once and for all... What do you think?
Yes, I'll just run it without pytype, thanks!
Hi @mallamanis , How are you? Thanks for sharing the code!
I'm trying to reproduce the PLDI dataset +results. I am preparing the data according to: https://github.com/typilus/typilus/blob/master/src/data_preparation/README.md. I ran
bash scripts/prepare_data.sh
and modified it to have the lineinstead of the two loops that start with:
while IFS= read -r line
andfor repo in ./*; do
.This is running for 5 days already, and
pytype-single
is the process that currently runs. The last three lines that were printed to console are:Is this expected? Does that mean that about 30% of the work is done, and it will take approximately 10-12 more days? This is running in an Ubuntu machine with many cores and enough RAM (
top
shows that only a single core is utilized, and 37% of the RAM).Thanks!