Reproducing PLDI results

urialon commented 4 years ago

Hi @mallamanis , How are you? Thanks for sharing the code!

I'm trying to reproduce the PLDI dataset +results. I am preparing the data according to: https://github.com/typilus/typilus/blob/master/src/data_preparation/README.md. I ran bash scripts/prepare_data.sh and modified it to have the line

bash /usr/src/datasetbuilder/scripts/clone_from_spec.sh /usr/src/datasetbuilder/pldi2020-dataset.spec

instead of the two loops that start with: while IFS= read -r line and for repo in ./*; do.

This is running for 5 days already, and pytype-single is the process that currently runs. The last three lines that were printed to console are:

[119/411] infer google.auth.jwt
[120/411] infer google.oauth2._client
[121/411] infer google.oauth2.service_account

Is this expected? Does that mean that about 30% of the work is done, and it will take approximately 10-12 more days? This is running in an Ubuntu machine with many cores and enough RAM (top shows that only a single core is utilized, and 37% of the RAM).

Thanks!

urialon commented 4 years ago

It actually seems that this loop: https://github.com/typilus/typilus/blob/master/src/data_preparation/scripts/prepare_data.sh#L58 Is completely parallelizable, right?

mallamanis commented 4 years ago

Hi @urialon Thanks for looking into this!

The situation is a bit complicated here. The output you're seeing is from pytype which does type inference for data augmentation. The 121/411 means that it has inferred types for 121 of the 411 dependent packages of a single project.

The good news is that pytype caches the results for the dependent packages, which means that they are reused for other projects. The bad news is that it seems to me that pytype cannot run in parallel for many projects (the parallelization you suggest) as it will introduce conflicts in the cache (...). Also, pytype is single-threaded...

Others have also complained to me that the extraction takes too long. When I extracted the data it took about 10 days, but this doesn't seem to match the experience that others had, which is puzzling... (could it be that the Azure Fsv2 are so much faster?)

I think that this is an important issue (given that we cannot practically share the extracted data), so I see three options:

Wait

Do not infer types on dependency packages, ie change L60 to

pytype -V3.6 --keep-going -o ./pytype -P $repo infer $repo

Do not run pytype at all (delete L57-L72)

This won't replicate the exact corpus we used, of course. On the other hand, we never measured the impact that pytype has, which might be small.

A higher-level comment here is that the data situation in our field is problematic, due to legal constraints. It'd be nice to figure out a reasonable way to fix this...

urialon commented 4 years ago

Thanks for the quick response! I'll wait another week or so, and if it doesn't stop I'll see what happens without pytype at all.

mallamanis commented 4 years ago

Let me know how this goes. It important to make sure that the data extraction can be replicated.

urialon commented 4 years ago

still running...

mallamanis commented 4 years ago

That's very confusing :(

I suggest that we either remove pytype and rerun all the models in the new dataset or we try a newer version of pytype (here). The first option seems the most reasonable, since it will speed-up the extraction once and for all... What do you think?

urialon commented 4 years ago

Yes, I'll just run it without pytype, thanks!

typilus / typilus

Reproducing PLDI results #1