sanyaade-speechtools / delphi-museum-project

Automatically exported from code.google.com/p/delphi-museum-project
0 stars 0 forks source link

Consider refining indexer to be smarter about facet interactions #202

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Perhaps we should consider each facet for each N-Gram, so that the longest
n-gram match wins, rather than doing each facet one at a time for all
n-grams in a field. 

From a mail thread:

The question is: what phrases would we use to recognize the cultures and
places in the ontology, and can we in general look first for places
(removing found phrases), and then look for cultures.

Thus, if we look for places first, we find "San Francisco", and remove it
from further consideration on that object, so that "San" is not matched as
a culture. 

In your example, if an object had the location "Owens Valley" we would find
it and remove it. So if the full culture name "The Big Pine Band of Owens
Valley Paiute Shoshone Indians of the Big Pine Reservation" was in there,
the location "Owens Valley" would make us NOT match the culture properly. 

The question is then whether objects are really marked in TMS with the full
culture name, or whether shorthands are used that would not cause the
problem above. However, I suspect that even a short-hand would look like
"Big Pine Shoshone" and so if Big Pine is a location, we would not see the
proper culture and would match the much more generic "Shoshone".

Maybe we need a slightly smarter algorithm in the indexer...

Patrick

> -----Original Message-----
> From: Natasha Johnson [mailto:johnsonnl@berkeley.edu]
> Sent: Wednesday, January 27, 2010 10:39 AM
> To: 'Michael T. Black'; 'Patrick Schmitz'
> Subject: RE: Feedback on Object: 14-1582
> 
> Often in California the proper political name for tribes combine 
> culture and then place such as: "The Tachi Yokut tribe of the Santa 
> Rosa Rancheria".
> That is a name that is somewhat separate from the culture (as Tachi 
> Yokuts can be registered at other Rancherias, and they existed before 
> the Rancheria was formed, and lived many other places than where the 
> Rancheria is now). Or another example is "The Big Pine Band of Owens 
> Valley Paiute Shoshone Indians of the Big Pine Reservation".  The 
> cultural name in that case would be the Owens Valley Paiute Shoshone 
> Indians, as Paiutes and Shoshones can be found from Bakersfield to 
> Idaho.
> 
> Hope this helps, I'm not sure what the exact question is.
> 
> -----Original Message-----
> From: Michael T. Black [mailto:mtblack@berkeley.edu]
> Sent: Tuesday, January 26, 2010 7:36 PM
> To: Patrick Schmitz
> Cc: Natasha Johnson
> Subject: Re: Feedback on Object: 14-1582
> 
> Hi there,
> 
>   I'll run it by Tasha, but off the top of my head, I can think of the
> 
> Pueblo tribes (e.g., San Ildefonso Pueblo), where the tribe is formed 
> by adding "Pueblo" to the place name of the Pueblo.
>  But in these cases, figuring out which comes first is moot, as 
> everything from that place (at least in our collection) is associated 
> with that culture (and in many cases vice versa, as these were 
> essentially city-states).
> 
> Michael
> On Jan 26, 2010, at 5:36 PM, Patrick Schmitz wrote:
> 
> > BTW, rather than adding a ton of exclusions, we should order 
> > location before culture, and then the logic will preclude matching 
> > San.
> >
> > Can you think of cases where the culture name is longer than the 
> > location (e.g., "San Francisco foobar tribe"), and we'd want to do
> the reverse?
> >
> > Patrick
> >
> >> -----Original Message-----
> >> From: Michael T. Black [mailto:mtblack@berkeley.edu]
> >> Sent: Monday, January 25, 2010 5:17 PM
> >> To: Patrick Schmitz
> >> Cc: Delphi Feedback
> >> Subject: Re: Feedback on Object: 14-1582
> >>
> >> I'm not sure, but I think you're right...
> >>
> >> Will fix.
> >>
> >> On Jan 25, 2010, at 5:06 PM, Patrick Schmitz wrote:
> >>
> >>> Stuff from San Francisco, is not all from the San culture.
> >>
> >>
> >
> 
> 

Original issue reported on code.google.com by LudicrousResearcher@gmail.com on 27 Jan 2010 at 7:40