silk-framework / silk

Silk Linked Data Integration Framework
http://silkframework.org/
Other
240 stars 63 forks source link

Blocking seems not to be working #60

Open skarampatakis opened 8 years ago

skarampatakis commented 8 years ago

Hi, we have been using the Silk Single machine to create some links between two datasets. We would like to enable Blocking to reduce running times. But nothing seems to happen.

While here is denoted that blocking should be enabled by adding [<Blocking blocks="100" />]

Java throws an error about mailformed configuration.

We changed it to <Blocking blocks="100" />

Silk seems to running but there is not any reduction in running times. Is it actually use it but has no effect because of our data? Or wrong configuration?

afeliachi commented 8 years ago

Hi Sotirios, What distance measure(s) are you using? have you checked that the attribute "indexingindexing="true"" in your script?

skarampatakis commented 8 years ago

Where should this attribute be? We are using a mix of distance measures (levenstein, jarro, dice, jaccard etc) with thresholds 0.2.

afeliachi commented 8 years ago

you find it in every Compare element, example: <Compare id="comparison1" required="false" weight="1" metric="levenshteinDistance" threshold="1.0"indexing="true"> you have to put it to "true" for every distance measure you want to use for blocking. In fact the blocking is based on the indexing of the values used in the comparisons (a block will contain similar values only). for the thresholds, I know that it depends on the data and on how much you want your interlinking to be strict, but I advise you to be careful when choosing them, 0.2 would work for normalized distance measures only. See the plugins doc for more details on each distance measure.

skarampatakis commented 8 years ago

We included indexing="true"in every metric with no significant result.

afeliachi commented 8 years ago

After taking a look into the code, I think the blocking is activated by default, even if you don't add the <Blocking blocks="100" /> , the blocking was working from the begining. It's also the case for the indexing. to be sure, can you take a look into your {user_home}/.silk/entityCache/{your_interlink_id}/ you will find two folders "source" and "target", each one must contain 100 folders representing the blocks. If it's the case, that means the blokning is working just fine.

you can also try with <Blocking blocks="10" /> and you'll probably see that the execution time will become longer.

skarampatakis commented 8 years ago

I have tried running SiLK single machine in different OS. In Ubuntu 15.04 it seems that blocking is working, giving bad results, low number of links. In windows 10 silk seems to ignore blocing command giving better results for our task. While this sounds as a bug, at least we found out that blocking isn't helping. So for now we will not be using blocking. The question is how do i disable blocking? It seems that if I comment out the blocking statement in Ubuntu, silk ignores it and enables blocking by default as you mentioned. If we declare enabled="false" silk gives no results.

afeliachi commented 8 years ago

This seems to be realy odd. Normally enabled="false" would do the job. Can you post your whole script please? It may better understanding the bug. Meanwhile I hope the project members could give a better answer to your concern. I am mainly just a user like you :)

skarampatakis commented 8 years ago

Thank you for your time.

<?xml version="1.0"?>
<Silk>
  <Prefixes>
    <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#"/>
    <Prefix id="xsd" namespace="http://www.w3.org/2001/XMLSchema#"/>
    <Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#"/>
    <Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
    <Prefix id="sesame" namespace="http://www.openrdf.org/schema/sesame#"/>
    <Prefix id="fn" namespace="http://www.w3.org/2005/xpath-functions#"/>
    <Prefix id="skos" namespace="http://www.w3.org/2004/02/skos/core#"/>
  </Prefixes>
  <DataSources>
    <DataSource id="codelist1" type="file">
      <Param name="file" value="source.rdf"/>
      <Param name="format" value="RDF/XML"/>
    </DataSource>
    <DataSource id="codelist2" type="file">
      <Param name="file" value="target.rdf"/>
      <Param name="format" value="RDF/XML"/>
    </DataSource>
  </DataSources>

  <Blocking enabled="false"  />

  <Interlinks>
    <Interlink id="labels">
      <SourceDataset dataSource="codelist1" var="a">
        <RestrictTo></RestrictTo>
      </SourceDataset>
      <TargetDataset dataSource="codelist2" var="b">
        <RestrictTo></RestrictTo>
      </TargetDataset>
      <LinkageRule linkType="skos:closeMatch">
        <Aggregate type="max">
          <Compare metric="levenshtein" threshold="0.20" >
              <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>            
          <Compare metric="jaro" threshold="0.20" >
           <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="jaroWinkler" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="jaccard" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="dice" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="softjaccard" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
        </Aggregate>
        <Filter limit="10"/>
      </LinkageRule>
    </Interlink>
  </Interlinks>
  <Outputs>
    <Output id="suggestions" type="file" minConfidence="0.5">
      <Param name="file" value="top10_project5.nt"/>
      <Param name="format" value="N-TRIPLE"/>
    </Output>
    <Output id="exactMatch" type="file" minConfidence="1">
      <Param name="file" value="exact_project5.nt"/>
      <Param name="format" value="N-TRIPLE"/>
    </Output>
    <Output id="score" type="alignment" minConfidence="0.5" maxConfidence="1">
      <Param name="file" value="score_project5.rdf"/>
      <Param name="format" value="RDF/XML"/>
    </Output>
  </Outputs>
</Silk>
skarampatakis commented 8 years ago

If we get enabled="false" in Ubuntu java throws an error /home/user/.silk/entityCache/labels/source/block0/parition0 (No such file or directory) - error loading resources