kucci / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Trie interface(s) and implementation(s) #10

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is a contribution of LimeWire's PatriciaTrie, as discussed at: 
http://groups.google.com/group/google-
guice/browse_frm/thread/ffb2a3b3b9e39e79?tvc=1 .

The files can be licensed as necessary (we own the copyright and can 
change/transfer the license).  I'm not sure what license, if any, these 
would need to be for inclusion.

Original issue reported on code.google.com by sberlin on 24 Sep 2007 at 6:12

Attachments:

GoogleCodeExporter commented 9 years ago
thanks!

It would be really, really (really) helpful if we could collect, here, a good 
variety
of use cases for this trie.

Original comment by kevin...@gmail.com on 1 Oct 2007 at 7:51

GoogleCodeExporter commented 9 years ago
We use it internally for a few cases.
 1) An IP list.  We have a KeyAnalyzer that analyzes ip addresses to store them 
(and/or ranges of them) as compactly as possible.  See: 
https://www.limewire.org/fisheye/browse/~raw,r=1.18/limecvs/core/com/limegroup/g
nutel
la/filters/IPList.java . 

 2) Storing a Kademlia DHT's internal structure.  This is probably a very specific 
use-case, although it works very well. :)

Other use cases:
 3) A very efficient (both in memory & CPU) dictionary.  It could be used for 
something like a cellphone's phonebook.  You start typing and it immediately 
finds 
all strings that began with your string.  At the same time the structure acts 
like a 
Map preventing you from adding the same String twice.  It's kind of like a 
super-
SortedMap.

I'm sure there's a ton of other cases, but those are the ones we use (and have 
plans 
to use).  The really awesome things about Patricia is the way it does lookups.  
Consider the IPList -- IPv4 addresses have a fixed size of 32 bits.  So no 
matter 
how many items you have stored in the Trie, it will always do at most 32 bit 
comparisons.  And it will often do less (because it intelligently traverses the 
tree), and it will keep things sorted and tell you all addresses that begin 
with 
X.Y, etc...  

Original comment by sberlin on 1 Oct 2007 at 8:10

GoogleCodeExporter commented 9 years ago
Great stuff, folks.  Thanks so much.

With the sheer mountain of work involved in getting our existing stuff 
polished, I'm
afraid it may be some time before I can look at breaking into new territory. 
But, I
do think that this trie and tries in general are a very promising addition, so 
don't
lose hope!

Original comment by kevin...@gmail.com on 16 Oct 2007 at 3:37

GoogleCodeExporter commented 9 years ago
Disclaimer: I don't know much about Tries.

I can imagine that many use cases for a Trie break into at least two categories

1. The SortedMap/NavigableMap the right API abstraction that the caller wants, 
but
due to the nature of the data, a Trie can be much faster than existing
implementations of these interfaces.

2. What the caller wants to see doesn't look like a Map at all, but something 
much
simpler that deals only with prefix-matching like (rough guess):

  public interface Trie<N, V> {
    boolean containsMatch(List<? extends N> sequence);
    V get(List<? extends N> sequence); // doesn't require exact match
    void put(List<? extends N> sequencePrefix, V value);
  }

A separate issue: should we consider providing an alternate interface that is
tailored to character-based tries, so this alternate API could use CharSequence 
in
place of List<Character>?  Then there would be two simple methods
Tries.asCharacterTrie() and Tries.forCharacterTrie() to go back and forth 
between the
two.

Original comment by kevin...@gmail.com on 23 Oct 2007 at 4:29

GoogleCodeExporter commented 9 years ago
From my experience, #1 is exactly right.  A Trie easily doubles as a 
super-efficient 
SortedMap/NavigableMap.

You hit the nail on #2 with the CharSequence version.  Trie's I've encountered 
function with a similar interface, but tailored towards the combined sequence 
instead of the pieces that build up the sequence.  That is, the methods would 
look 
like:

   public interface Trie<K, V> {
      boolean hasPrefix(K prefix);
      SortedMap<K, V> getPrefixedBy(K prefix); // returns a view over all entries 
prefixed by K
      void put(K key, V value);
   }

The difference is the Trie is directly acting off a K value, which would be the 
CharSequence to a Character, or a NumberSequence to a Number.  This has the 
advantage of being able to reuse the interface for arbitrary-length keys or 
fixed-
length keys whose contents don't divide easily into objects.  (For example, IP 
addresses being composed of bits, CharSequences, phone numbers, etc...)

The Trie interface in the submission includes a few additional convenience 
methods 
and directly subclasses SortedMap (if it were targetted for 1.6, it would 
extend 
NavigableMap too).  The additions are for better use with fixed-size keys and 
being 
able to visit the entries in the map while traversing (with a 'Cursor').

Original comment by sberlin on 23 Oct 2007 at 5:49

GoogleCodeExporter commented 9 years ago
(Really, that getPrefixedBy could just return a List<V> - having it return a vw 
of 
the map itself enables some really cool things.)

Original comment by sberlin on 23 Oct 2007 at 5:50

GoogleCodeExporter commented 9 years ago

Original comment by kevin...@gmail.com on 3 Nov 2007 at 5:32

GoogleCodeExporter commented 9 years ago

Original comment by kevin...@gmail.com on 2 Jun 2008 at 5:48

GoogleCodeExporter commented 9 years ago
Hello,

currently I'm wirting something for my studies and did a bit of research about 
Tries.
I fall upon a Hash Table performance like Trie called HAT-Trie.
It's a proposal for a fast memory aware Trie that can deal with non equal 
memory costs.
http://crpit.com/confpapers/CRPITV62Askitis.pdf

In my opinion it maybe would be a great advantage to use this Trie.

Hope I could help,
Tim

Original comment by tim.frey...@googlemail.com on 22 Mar 2009 at 10:31

GoogleCodeExporter commented 9 years ago
To completely generalize it, a Trie is kind of a particular way of representing 
a
  Set< List< E > >
or, easily enough, a
  Map< List< E >, V >
where E is typically a character (List<E> is a String). The representation 
allows
certain specific operations to be very efficient, and is efficient for storage 
when
there are many common prefixes among the List<E> (a spelling dictionary, for 
example).

There may be some benefit to completely abstracting it in this way, although
translating the common use case of a String key would be an excessive amount of 
overhead.

Original comment by ray.a.co...@gmail.com on 3 Aug 2009 at 5:53

GoogleCodeExporter commented 9 years ago
As a use case, I've used it for finding phrases within a large text document. I 
have
some set of phrases that I'm looking for (not Strings, but sequences of words, 
hence
- List<String>) stored as a Trie. To search, I walk the normalized text (split 
on
whitespace, punctuation removed, handle case insensitivity, etc.) which is also 
a
List<String>, keeping track of possible matches as I go.

Generalizing, this is finding all possible subsequences of List<E> (the text) 
that
are members of a given Set<List<E>> (the search phrases).

The big benefit is that the text, which can be very large, only needs to be 
traversed
once to find any one (or all) of thousands of possible phrases. It scales very 
well.

Original comment by ray.a.co...@gmail.com on 3 Aug 2009 at 6:05

GoogleCodeExporter commented 9 years ago
I have the same use case as Ray.a.conner, and I'd also use it for scalable 
tab|auto-
completion in interactive consoles|text fields.

Original comment by cresw...@gmail.com on 3 Aug 2009 at 6:31

GoogleCodeExporter commented 9 years ago
Are you folks using the attached PatriciaTrie implementation, or a separate one?

Original comment by sberlin on 3 Aug 2009 at 6:34

GoogleCodeExporter commented 9 years ago
The Google code base includes multiple Java trie implementations. If we chose 
to add
a trie to the library, we'd probably start with one of those.

Original comment by jared.l....@gmail.com on 3 Aug 2009 at 6:44

GoogleCodeExporter commented 9 years ago
I rolled my own implementation, some years ago. I wasn't storing simple 
strings, and
pretty much everything out there was character-centric. Once I made the 
abstraction
to sequences of strings, sequences of objects was a no-brainer. Although I never
needed to implement many of the Trie operations that others find useful; I only
needed sub-sequence matching.

Original comment by ray.a.co...@gmail.com on 3 Aug 2009 at 8:13

GoogleCodeExporter commented 9 years ago

Original comment by kevin...@gmail.com on 17 Sep 2009 at 6:02

GoogleCodeExporter commented 9 years ago

Original comment by master.j...@gmail.com on 24 Nov 2009 at 8:06

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 Jul 2010 at 3:54

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 Jul 2010 at 3:56

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 27 Jan 2011 at 1:59

GoogleCodeExporter commented 9 years ago
Work is beginning... expect in maybe release 10 or 11.

Original comment by kevinb@google.com on 3 Feb 2011 at 6:24

GoogleCodeExporter commented 9 years ago
Awesome! I don't know if it's of any use for you but our Trie has been under 
ASL 2.0 for a few years now and has undergone a few refacorings.

<http://code.google.com/p/patricia-trie>
<https://github.com/rkapsi/patricia-trie>
<https://github.com/rkapsi/simple-patricia-trie>

The Trie interfaces are maybe useful even if you're not going to use PATRICIA.

Cheers,
 Roger 

Original comment by rka...@gmail.com on 24 Feb 2011 at 10:35

GoogleCodeExporter commented 9 years ago
Yep, will certainly take this into consideration, thanks for the links. I find 
the last method (#prefixMap(prefix)) especially nifty, exposing the recursive 
structure of a trie (though I would expect to see the self type being returned, 
instead of SortedMap, that's lossy). 

That said, I have to admit that personally I'm a bit predisposed against 
patricia. My reasoning is that for the same kind of performance, we could go 
with a ternary trie, which is more flexible (e.g. no key analyzer, not 
necessarily unbalanced trees). But I don't trust what "that kind of 
performance" means for big tries. I'm still going through the literature 
hunting for other options (e.g. 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499, 
http://www.naskitis.com/)

Original comment by jim.andreou on 25 Feb 2011 at 2:35

GoogleCodeExporter commented 9 years ago
It probably could return the Trie itself, but would require a bit more work to 
make sure the additional trie operations stayed with the sub-trie range.  Roger 
& I ate, slept & breathed patricia for a couple weeks while putting this 
together years ago.. so don't hesitate to ask if you have any questions on it.

Original comment by sberlin on 25 Feb 2011 at 2:55

GoogleCodeExporter commented 9 years ago
I'd be interested in working on this -- heh, I've been living and breathing 
http://hackage.haskell.org/package/TrieMap for some time now.

Original comment by wasserman.louis on 7 Mar 2011 at 5:41

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 23 Mar 2011 at 1:48

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:19

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 3:11

GoogleCodeExporter commented 9 years ago
Any update on this?  I just checked out 11.0.2 and I still don't see a Trie 
implementation in it.

Original comment by ian.g.si...@gmail.com on 10 Mar 2012 at 4:48

GoogleCodeExporter commented 9 years ago
There's been some progress, but unfortunately it's not high-priority. :-(  
Unlikely for 13.0, maybe 14.0.

Original comment by kevinb@google.com on 11 Mar 2012 at 11:08

GoogleCodeExporter commented 9 years ago
So what would it take to make it a higher priority?  Getting more people to 
star the issue and/or rant and rave?

Original comment by ian.g.si...@gmail.com on 22 Mar 2012 at 6:23

GoogleCodeExporter commented 9 years ago
It's just not a data structure that has that many use cases or appears that 
often, I think?

Original comment by wasserman.louis on 22 Mar 2012 at 6:31

GoogleCodeExporter commented 9 years ago
Or to put it another way, the hardness of designing a trie API to Guava's 
standards is very high -- it's such a general structure! -- and it isn't very 
widely applicable.

I think starring the issue can't hurt, and might help; ranting and raving *can* 
hurt and definitely won't help; talking about why you need this, discussing 
your priorities for the data structure and outlining some detailed use cases 
helps more than anything else, but won't necessarily guarantee that it'll be a 
top priority.

Original comment by wasserman.louis on 22 Mar 2012 at 6:35

GoogleCodeExporter commented 9 years ago
I was trying to be cynical in regard to ranting and raving and was implying 
people making a strong use case for it actually :)

I just wanted to get a better idea into what provoked priority in Guava.

Original comment by ian.g.si...@gmail.com on 22 Mar 2012 at 8:24

GoogleCodeExporter commented 9 years ago
I feel that guava could benefit from Trie internal implementation. This can be 
very efficient for java devs looking to use Scala like data structures. 
http://www.meetup.com/saylambda/events/47056842/

Original comment by ma...@vekslers.org on 27 Mar 2012 at 4:10

GoogleCodeExporter commented 9 years ago
I'm interested in a persistent approach based on the COW-principle, that is 
revisions of the tree are stored in a very compact form and updates affect a 
minimum of nodes (which I think should be the normal case for tries). But most 
likely I will have to modify an existing implementation or write it from 
scratch. That is if Lucene can't be used, as I want to provide a full-text 
index for a versioned tree-structure.

Original comment by Lichtenb...@gmail.com on 28 Mar 2012 at 5:24

GoogleCodeExporter commented 9 years ago
I'd just like to the sentiment that tries are very useful and find uses all 
over the place. As a PhD student in CS, I work on natural language processing 
and machine translation. There, we use tries for efficiently storing language 
models (a bunch of probabilities associated with n-grams) and even translation 
models (a list of target phrases associated with some source language 
phrase/n-gram and, of course, more probabilities).

NLP people would love to see a trie in Guava. :)

Original comment by jon.h.clark on 28 Mar 2012 at 5:30

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 May 2012 at 7:43

GoogleCodeExporter commented 9 years ago
+1 for Trie, I find myself wanting a solid Java implementation on occasion, and 
usually fall back to a TreeMap<String> hack which gets the job done (albeit 
with a much larger memory footprint).

Original comment by Sam.Hall...@gmail.com on 1 Aug 2012 at 7:30

GoogleCodeExporter commented 9 years ago
I'd be very interested in a Trie implementation in guava. I use hadoop and need 
to filter the incoming data by a large set of strings, on the order of 2 
million strings. Right now I have a file containing these strings, one per 
line, that each mapper reads into a hashset to do the filtering. The speed 
performance is fine, it just takes a ton of memory. I've used the exact same 
approach with the exact same set of strings in c++ and it never took close to 
the amount of memory that java is using. The strings I'm filtering by have a 
good deal of common leading characters, I think our memory savings would be 
significant with a trie. I'd like to see a TrieMap and a TrieSet.

Original comment by onlyn...@gmail.com on 22 Aug 2012 at 2:28

GoogleCodeExporter commented 9 years ago
An interesting project for concurrent tries: 
http://code.google.com/p/concurrent-trees/

Original comment by phraktle on 15 Nov 2012 at 11:53

GoogleCodeExporter commented 9 years ago
I am presently finding Trie structures very useful in my code generation / 
dependency injection library.  Scanning classpaths is expensive, and java 
packages often share long, repetitive package names; couple that with the fact 
my codegen library descends not just through the package structure, but the 
class, method, field and annotation structure as well (into a single uber-trie 
of your module), and a fast, concurrent trie is absolutely vital.  

Compared to the org.reflections classpath scanner (which uses MultiMap and 
single threaded scanning), my Trie-backed implementation can, if scanning 
multiple jars with multiple threads, run over seven times faster, and deliver 
deep iterators, instead of multiple MultiMaps delivering set views.

+1 For Trie

Original comment by Ja...@wetheinter.net on 24 Jan 2013 at 1:32

GoogleCodeExporter commented 9 years ago
Sorry for the double post, but one more place I use Trie's for great effect: My 
StringTrie which uses char[] or CharSequence as keys, internally uses a 
CharPool that extends the base Trie structure, and translates a char[], start 
index and end index into singleton char[] instances (which pass == tests).  
This changed my memory footprint from double that of TreeMap to less than half 
(on larger datasets).  

My implementation does not store a node per character; rather, it will store a 
char[] matching a unique sequence of chars, and using a trie-backed CharPool, 
it will only create as many arrays as there are unique sequences in your data 
(and using a shared CharPool shared across instances ensures I never allocate a 
byte more memory than I need).

In my benchmarks, I found that memory usage on highly-variant, dense data 
structures (like long sequences of toStringd() random numbers) was quite high 
(a new char[] per node), but with a CharPool to the rescue, memory usage was 
based on number of unique sequences + number of unique sequence combinations, 
instead of unique sequence combinations * number of sequences per combination.

Original comment by Ja...@wetheinter.net on 24 Jan 2013 at 2:22

GoogleCodeExporter commented 9 years ago
Hi James,

Any chance that you'll share your implementation? I also have a use case where 
I'm currently using a TreeMap (which allows for a poor man's prefix matching) 
and I would like to experiment with a Trie to save some memory.

Original comment by knut.wan...@gmail.com on 24 Jan 2013 at 2:40

GoogleCodeExporter commented 9 years ago
I'll package it up, put it on github in the near future, and reply back here 
with a link.  Star the issue, and you'll get emailed.

The concurrency support is tied directly into my own internal library (which 
does let you inject your own implementations for handling threads or ignoring 
multithreading), but the whole library itself is Not Safe For Work (tm).

If I can't deploy a new version soon, I can put a cut out standalone of the 
trie and the charpool in a gist instead; just be warned that for small maps, 
you won't be saving space.  Large maps, especially ones with common prefixes 
(or subsequences, using a shared CharPool) WILL save you on memory.

For my purposes, I save lots, because I am working with long, repetitive 
strings (especially when descending into inner classes, methods, fields, vars, 
etc).  If you just maintain a big map of mostly-unique strings, trie is 
probably not for you. 

Original comment by Ja...@wetheinter.net on 24 Jan 2013 at 4:59

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi James,

Did you have time to put your trie implementation up on github? I'm interested 
because I have a large pool of keys sharing very long prefixes and I'm 
interested in your approach.

Original comment by oliver.s...@gmail.com on 15 Mar 2013 at 12:04

GoogleCodeExporter commented 9 years ago
Jetty util pkg has some Trie structures if it helps - 
http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-util/src
/main/java/org/eclipse/jetty/util/ArrayTernaryTrie.java

Original comment by ashwin.j...@gmail.com on 15 Mar 2013 at 5:01

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:16

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:19