PH-tree - Project at National scientific computing laboratory - Brazil

andredemori commented 5 years ago

Hello Mr. Zäschke, I am a student of Information Technology and Communication in Brazil and currently work on a project titled Metric Indexing in Space Junctions at LNCC with Doctor Fábio Porto. In my research I worked on comparisons between data indexing methods like PH-tree, Slim-tree, Quatree. In these comparisons I observed the tree creation time and the search times to the neighbors in Range Queries. In short, I could see that PH-tree does not return all neighbors in their queries. I performed precision and recall calculations, and Slim-tree did better. Can you tell if the fact that PH-tree does not return all the neighbors is something normal of PH-tree's structure? It has something to do with the 'worst cases' mentioned in your article in '3.4 Spacial efficiency' or '3.5 Query efficiency'.

Thank you in advance.

André Muniz Demori - LNCC

tzaeschke commented 5 years ago

Hi, thanks for using the PH-Tree.

The PH-Tree should always work 100% correct, that means it should always return all matching entries. But I admit that rangeQuery() is a feature that I have hardly used/tested.

One thing to keep in mind is that the PH-Tree can store every point only once. If the same coordinate is added multiple times, the previous value is overwritten. Queries will only return the latest value for any coordinate.

I'm not entirely sure what you are exactly doing, you mention 'neighbors' and 'range queries', but the PH-Tree has no function to return neighbors in range queries. So I'm not really sure what kind of queries you performing:

range queries?
nearest neighbor queries?
window queries?

Some questions:

Are you using a distance function? Which one?
Which PhTree are you using? PhTree, PhTreeF, PhTreeSolid or PhTreeSolidF?
What kind of dataset (number of entries, dimensionality, ....)?
Can you give an example of some code where a point/rectangle is not returned?

If you are using rangeQuery(), maybe consider using window queries (query(...)) and filter the results afterwards, that may be faster.

Also, could you point me to some information about the SlimTree?

tzaeschke commented 5 years ago

Btw, if you want to store multiple values for a single coordinate you can store a HashSet or List of values for a coordinate.

andredemori commented 5 years ago

In my project, the ultimate goal is to perform a search on astronomical datasets to find geometric patterns brought about by gravitational lenses like the Eistein Cross. In this phase of the project I am comparing these structures in relation to the search for neighbors given a point and a distance (Range Queries ()). The dataset I'm using has 1,035,204 objects with their spatial coordinates and other attributes. Slim-tree is a Metric Access Method created in Brazil at the University of São Paulo - USP. I'm sending you the link to the article below:

http://conteudo.icmc.usp.br/pessoas/caetano/artigos/Traina2000_EDBT_SlimTree.pdf

This is a method that can optimize the issue of overlaps in nodes. I have the algorithm written in C ++ and in my tests it returned me an F-Measure of 1.0, ie 100% of the data that should be returned were returned.

In the case the PH-tree code returned an F-measure of approximately 0.97, and did not return all the neighboring points of the query objects.

andredemori commented 5 years ago

resultado

andredemori commented 5 years ago

As you can see, using a dataset with 98 objects, the so-called 'positivo' is the number of objects returned and that should actually be returned.

tzaeschke commented 5 years ago

Thanks for the reference, I did not know about the SlimTree.

I don't quite understand the meaning of the second line: "negativo | 12 | vizinhos nao retornados que nao deveriam ser retornados". I assume there should be almost 1,000,000 points that are (and should) not be returned? But it's not really important, just out of interest.

So I assume you are using 3D floating point coordinates via PhTreeF?
Could you tell me which version of the PhTree you are using (I assume 2.0.2)?
I also assume you are using the default distance function by simply calling rangeQuery(dist, center), correct?

As I mentioned, the PH-Tree ignores insertion with identical coordinates. So maybe you could check the following:

PhTree.size() return the number of entries. Can you confirm that it returns 1,035,204? If not, then this indicates that there may be duplicate coordinates during insertion.
When you look at the result from SlimTree, are they all distinct? No duplicate coordinates?

If you can't find any duplicates, would it be possible to send me the dataset so I can debug the problem?

tzaeschke commented 5 years ago

Ah, I understand now, the results are from a smaller dataset. Could you answer my questions and, if you don't find any duplicates, send me this small dataset?

andredemori commented 5 years ago

Thank you very much, you are absolutely right. I checked the dataset and there are duplications. As you mentioned, PH-Tree ignores insertion with identical coordinates.

Answers:

1 - Yes 2 - Yes 3 - Yes

tzaeschke commented 5 years ago

Thanks. Well technically, it's doesn't ignore the duplicates, it replaces the previous value. To fix this you could use PhTreeF<List<YourData>> or maybe simply PhTreeF<Integer> (where Integer is initially '1' but is incremented when the value already exists). I will soon (1-2 weeks) release an update that provides Java 8 Map API functions, such as putIfAbsent() or compute(). Especially the latter may be useful to handle such an Integer-counter or a List/Map/Set of values.

andredemori commented 5 years ago

Thanks.

tzaeschke / phtree

PH-tree - Project at National scientific computing laboratory - Brazil #7