Question about the running time

nxchenbnu commented 1 year ago

Thank you for your work again ! I have an another question about the running time in your paper.

I have noticed that in Table 5(Time cost of The EA division), DivEA achieved a high level quality of running time which only causes 398 seconds in 100K datasets. I wonder the time here represents the total training time, right? If not so, can you explain the what it includes?

Thank you again!

uqbingliu commented 1 year ago

In Table 4 and 5, we aim to show extra time consumption apart from running an EA model. The "EA" column represents running EA model, while other column represents task division by each method.

In DivEA framework, the time costs of task division include three parts: (1) dividing unmatched source entities t1. (2) counterpart discovery of subtask i: t2_i . (3) build context graphs of subtask i: t3_i .

DivEA column represents using DivEA in a single-machine setting. The time of task division is: t1 + \sum_i {t2_i+t3_i} DivEA (par.) column represents using DivEA in parallelism setting. The time of task division is: t1 + \max_i {t2_i+t3_i}

nxchenbnu commented 1 year ago

I got more questions about implementing your codes.

I use 2 RTX 3090 and run on DBP-1M datasets. The context time costs around 60 hours and it seems inconsistenly with your table 4, in which the time for DivEA(par) costs only 428s. Maybe if I misunderstood your meaning in your paper? Looking forward to your reply.

uqbingliu commented 1 year ago

Hi @nxchenbnu , thanks for reporting this problem to me.

Can I ask what are DBP-1M datasets? I did not use it in our experiment and thus the 428s is not for it (it is for FB-DBP-2M). Can you give me a link to this dataset?

Another thing that I wanna know is which script file did you run? the configurations also matter.

If you are using a new dataset, I will help you to figure out where the problem is. I believe it will not take 60 hours since FB-DBP-2M takes <15h even in a centralized setting.

nxchenbnu commented 1 year ago

Thanks so much ! The DBP1M is from LargeEA (https://github.com/ZJU-DAILY/LargeEA) and the link is https://drive.google.com/file/d/15jeGD-6pVGlqI5jCn7KJfGIER6AeoQ-L/view.

I think the subtask_num matters and I directly change to 1/2 of which [run_over_perf_vs_sbp_2m.sh] uses. My script is as follows. --data_dir="../datasets/1m/en_fr" --kgids="en,fr" --output_dir="../output/1m/en_fr" --subtask_num=100 --subtask_size=39874 --ctx_g1_percent=0.4 --ctx_g2_conn_percent=0.0 --ctx_builder=v2 --ea_model="rrea" --alpha=0.9 --beta=1.0 --gamma=2.0 --topK=10 --max_iteration=5 --gpu_ids=0,1 --seed=1011

uqbingliu commented 1 year ago

thanks for sharing your running settings. I am checking the code and reckon the reason. Will update you on that later.

uqbingliu commented 1 year ago

hi @nxchenbnu , can you try to set --ctx_builder=v1, as in run_over_perf_vs_sbp_2m.sh.

nxchenbnu commented 1 year ago

Thanks so much !

I have tried and it worked on 100K datasets, although I stll can't approach the time frequency in your paper. I guess that's because I only use 2 RTX 3090 and you use 3 RTX 2080? I will try it on 1M datasets later.

I have another 2 questions. (1) I wonder maybe should I change subtask_num to 100? (2) The EA time seems also longer and I am try to fix the error.

Thanks again for your nicely help!

uqbingliu commented 1 year ago

I think you want to run code on your 1M dataset but not 100K datasets, right? If you run the code on 100K datasets, you can use my script like run_over_perf_vs_cps.sh.

As for the time consumption, both GPU and CPU configurations matter.
About how to set the hyperparameters, I start with context_size.

In this work, we basically want to run existing EA models on large-scale KGs without raising Out-of-Memory exceptions. If you set subtask_size as a specific int number, it will be directly affected by the size of the GPU memory. Any value of subtask_size not causing Out-of-Memory is acceptable but large ones are better since we can make full use of the GPUs.

As for subtask_num, a basic setting is total node num / subtask_size. You can set a larger num, which means each subtask contains fewer unmatched entities and thus more context entities. This may make the EA effectiveness better since the context_size becomes larger. But it will also take more total running time.

The last thing that I want to clarify is: in table 4,5, the column DivEA (par.) is to simulate the speed of running subtasks in parallel. So if you only have one machine, use the reported time in the DivEA column as a reference.

nxchenbnu commented 1 year ago

Thanks for your help !

After I set --ctx_builder=v1, the speed of DivEA improves. Maybe can you give some explanation on this?

I also find that the time for EA is too long, as the time for one subtask training costs like 3 minutes. I got 5 iteration, 100 subtasks, the overall training time may count up to 1500/60 = 25h. I wonder maybe can I short the iteration times from 5 to 3?

uqbingliu commented 1 year ago

Regarding setting of ctx_builder: When the KGs are of large scale, our evidence passing mechanism will be slow. So, in v1 implementation, we get a subgraph for each subtask first and then select the context entities from the subgraph.

Regarding training time: The setting of iteration should be as per the convergence of EA model. Smaller values like 3 may hurt the performance a bit. If you really want to save running time, you can consider reducing the epoch of training within each iteration, especially the later iterations.

nxchenbnu commented 1 year ago

Thank you very much, I have successfully implementing your code on 1M dataset.

Some small details I edited before I finally run the codes.

components.py,_select_g2_candidates: we calculate the dist weight of neighbours in _get_dist_weight. When the anchor in KG2 is None, some errors happen in _partition_g2_entities. I guess this is because when the subtask_num is large, some seed alignments accidently not fall into any subgraphs.
Similar problems happen in test, when test seed alignments from KG1 cannot found their counterparts in KG2, and _build_g2_subgraph,part_test_alignwill be None. The eval_alignment_tensor incompute_metrics , CLCS_torch will be tensor([]).

If my understanding is correct, I wonder maybe you can add some idea of CPS, to change the weights of already aligned entities in divding KG2. Hope all this can help you.

Last, I early made a mistake about the source of my datasets. It's from LargeGNN rather than LargeEA, which is specifully constructed for large-scale EA tasks. Here is the code link[( https://github.com/JadeXIN/LargeGNN)] and the dataset link [(https://figshare.com/articles/dataset/DBpedia1M/21119380)].

Thanks a lot for your nice help and well written codes !

uqbingliu commented 1 year ago

@nxchenbnu , thanks very much for sharing these details with me. Good luck with your work.

uqbingliu / DivEA

Question about the running time #2