ldbc / ldbc_snb_interactive_v1_impls

Reference implementations for LDBC Social Network Benchmark's Interactive workload.
https://ldbcouncil.org/benchmarks/snb-interactive
Apache License 2.0
101 stars 86 forks source link

Question on queries 3 and 4 #143

Closed djnakabaale closed 4 years ago

djnakabaale commented 4 years ago

This question is in reference to an earlier question/ issue #135 and I would like to follow it up with a another question.

I am experimenting with both cypher and Postgres queries on SF1 dataset, running directly in the respective servers. All queries seem to run fine and I get similar results, except for queries 3 & 4. This appears to be an issue with the endDate variable that I hope to get some guidance on, as I may be missing something. Here are some of the details in the scripts I run for query 4.

For Cypher:

Neo4j 3.5.22 Cypher-Shell 1.1.13

Both these queries have an endDate variable for which I have to convert startDate to nano seconds before adding durationDays to it.

{
        read
        while IFS='|' read -r col1 col2 col3
        do 
            # convert startDate to nanoseconds
            let toNano=$col2*1000000
            # add add duration to startDate
            let endDate=$(expr $toNano + $col3)
            { echo ":param personId=>$col1"; echo ":param startDate=>$col2"; echo ":param endDate=>$endDate"; cat ./queries/interactive-complex-4.cypher ; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
            ((i++))
        done
    } < ./ldbc_snb_datagen/substitution_parameters/interactive_4_param.txt

For Postgres:

psql (PostgreSQL) 11.9

export PG_DB_NAME=ldbc
export PG_USER=ldbcuser
export PG_PORT=5432

{
    read
    while IFS='|' read -r col1 col2 col3
    do 
        (psql -c -d $PG_DB_NAME -U $PG_USER -p $PG_PORT -f ./queries/interactive-complex-4.sql  -F '|' -A  -v personId=$col1 -v startDate="'`TZ=UTC date -u -j -r $((col2/1000)) '+%Y-%m-%d %H:%M:%S'`'" -v durationDays=$col3)
        ((i++))
    done
} < ./ldbc_snb_datagen/substitution_parameters/interactive_4_param.txt

In postgres I initially got this error ERROR: invalid input syntax for type interval, and I made a minor modification to the condition in the where clause of the nested select statement, adding timestamp to it; m_creationdate >= :startDate and m_creationdate < ((:startDate::timestamp) + INTERVAL '1 days' * :durationDays).

The results from both postgres and cypher are very different. I'll be happy to share some of the query results if that helps.

szarnyasg commented 4 years ago

@dmj2x thanks for the detailed issue.

Regrading your Cypher script, I find this bit suspicious:

            # convert startDate to nanoseconds
            let toNano=$col2*1000000
            # add add duration to startDate
            let endDate=$(expr $toNano + $col3)

For Interactive Q4, col3 is specified in days, so in order to add it to a nanosecond value, you would first need to convert it to nanoseconds by multiplying it by 864000000000.

Regarding representing dates in Neo4j, there are three key approaches:

  1. Use the datetime support introduced in 2017. This is the most convenient approach and it used in the dev branch.
  2. Use integers which encode yyyyMMddhhmmss000 (the last triple zeros representing milliseconds). This is used in the stable branch. The conversion from epochs to such values is handled by the CypherConverter class and I also have some Bash scripts for convenience.
  3. Use epochs. This is straightforward to import and to use for range selections but it gets problematic when one needs to extract fields such as day/month from the date. A potential workaround is to add those fields as separate properties.

I'll look into the Postgres code later.

djnakabaale commented 4 years ago

Hi @szarnyasg,

Regrading your Cypher script, I find this bit suspicious: For Interactive Q4, col3 is specified in days, so in order to add it to a nanosecond value, you would first need to convert it to nanoseconds by multiplying it by 864000000000.

I have updated my script to convert days to nanoseconds first,

let toNano=$col2*1000000
let daysToNano=$col3*864000000000
# add add duration to startDate
let endDate=$(( $toNano + $daysToNano ))
{ echo ":param personId=>$col1"; echo ":param startDate=>$col2"; echo ":param endDate=>$endDate"; cat ./queries/interactive-complex'-'$1.cypher ; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin

For some reason, the query results are the same even after I modified the script.

Also, the column creationdate to which startDate is compared, is in nanoseconds however when I convert startDate to nanoseconds first, I get no query results as opposed to using startDate date unchanged.

let startDate=$col2*1000000
let daysToNano=$col3*864000000000
let endDate=$(( $startDate + $daysToNano ))
{ echo ":param personId=>$col1"; echo ":param startDate=>$startDate"; echo ":param endDate=>$endDate"; cat ./queries/interactive-complex-4.cypher ; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin

I'll look into the Postgres code later.

Thank you for taking time to look into this.

szarnyasg commented 4 years ago

How do you load the data to Neo4j? In the generated files, they are not in epoch but using the String formatter:

yyyy-MM-dd'T'HH:mm:ss.SSSZ

(This is an almost RFC 3339-compliant format. A fully compliant String formatter would use e.g. yyyy-MM-dd'T'HH:mm:ss.SSS+00:00.)

djnakabaale commented 4 years ago

How do you load the data to Neo4j? In the generated files, they are not in epoch but using the String formatter:

I use neo4j-import that's available in neo4j 3.5

szarnyasg commented 4 years ago

To match the format of the timestamps created by the loader, you'll need a slightly more complicated Bash script:

EPOCH_MILLI=$col2
EPOCH_SEC=$(($EPOCH_MILLI / 1000))
export TZ=GMT
TIMESTAMP=`date +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`
djnakabaale commented 4 years ago

To match the format of the timestamps created by the loader, you'll need a slightly more complicated Bash script:

TIMESTAMP=`date +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`

I'm working on mac with bash version _GNU bash, version 3.2.57(1)-release (x8664-apple-darwin19) and it seems not to like the format, for example running

EPOCH_MILLI=1335830400000
EPOCH_SEC=$(($EPOCH_MILLI / 1000))
export TZ=GMT
TIMESTAMP=`date +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`

results into

date: illegal time format
usage: date [-jnRu] [-d dst] [-r seconds] [-t west] [-v[+|-]val[ymwdHMS]] ... 
            [-f fmt date | [[[mm]dd]HH]MM[[cc]yy][.ss]] [+format]
szarnyasg commented 4 years ago

Interesting, what's the output for you when running this?

date --version

For me it's

date (GNU coreutils) 8.32
djnakabaale commented 4 years ago

Interesting, what's the output for you when running this?

date --version

Interesting indeed, on mac

macOS Catalina
version 10.15.7

the output is


date --version

date: illegal option -- - usage: date [-jnRu] [-d dst] [-r seconds] [-t west] [-v[+|-]val[ymwdHMS]] ... [-f fmt date | [[[mm]dd]HH]MM[[cc]yy][.ss]] [+format]

szarnyasg commented 4 years ago

Try gdate.

djnakabaale commented 4 years ago

I didn't have gdate so installed version date (GNU coreutils) 8.32 with homebrew script seems to work fine and generates a time stamp.

function epoch () {
  EPOCH_MILLI=$1
  EPOCH_SEC=$(($EPOCH_MILLI / 1000))
  export TZ=GMT
  TIMESTAMP=`gdate +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`
  echo $TIMESTAMP
}

Also added another function to get the final end date

function epoch_end_date () {
  START_DATE=$1
  DAYS=$2
  DAYS_TO_MS=$(($DAYS * 86400000))
  EPOCH_MILLI=$(($START_DATE + $DAYS_TO_MS))
  EPOCH_SEC=$(($EPOCH_MILLI / 1000))
  export TZ=GMT
  TIMESTAMP=`gdate +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`
  echo $TIMESTAMP
}

All seems to work fine and I get results on some substitution params but not all. I would like to inquire on how I can verify the query results from by queries 3 & 4.

szarnyasg commented 4 years ago

I've tried it for one set of params per query. I had to add an echo ";"; do the script to get it working. This approach returns results for both queries.

$ { echo ":param personId=>21990232559429"; echo ":param startDate=>20120501000000000"; echo ":param endDate=>20120607000000000"; cat ./queries/interactive-complex-4.cypher; echo ";"; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
+------------------------------------------------+
| tagName                            | postCount |
+------------------------------------------------+
| "Hassan_II_of_Morocco"             | 2         |
| "Appeal_to_Reason"                 | 1         |
| "Principality_of_Littoral_Croatia" | 1         |
| "Rivers_of_Babylon"                | 1         |
| "Van_Morrison"                     | 1         |
+------------------------------------------------+
{ echo ":param personId=>21990232559429"; echo ":param startDate=>20120501000000000"; echo ":param endDate=>20120607000000000"; echo ":param countryXName: 'Belgium'"; echo ":param countryYName: 'Netherlands'"; cat ./queries/interactive-complex-3.cypher; echo ";"; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
+-----------------------------------------------------------------------------+
| personId       | personFirstName | personLastName | xCount | yCount | count |
+-----------------------------------------------------------------------------+
| 8796093028172  | "Karan"         | "Khan"         | 1      | 1      | 2     |
| 13194139537010 | "Rahul"         | "Rao"          | 1      | 1      | 2     |
+-----------------------------------------------------------------------------+

I'm polishing up the script and looking into how it performs for other params.

szarnyasg commented 4 years ago

Here's a script that works for me. Note that this only checks the first 4 lines of the params txt files (i.e. the first 3 parameter combinations) and uses an absolute path in my file system:

function epoch () {
  EPOCH_MILLI=$1
  EPOCH_SEC=$(($EPOCH_MILLI / 1000))
  export TZ=GMT
  TIMESTAMP=`date +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`
  echo $TIMESTAMP
}

function epoch_end_date () {
  START_DATE=$1
  DAYS=$2
  DAYS_TO_MS=$(($DAYS * 86400000))
  EPOCH_MILLI=$(($START_DATE + $DAYS_TO_MS))
  EPOCH_SEC=$(($EPOCH_MILLI / 1000))
  export TZ=GMT
  TIMESTAMP=`date +%Y%m%d%H%M%S000 --date @$EPOCH_SEC`
  echo $TIMESTAMP
}

{
    read
    while IFS='|' read -r col1 col2 col3 col4 col5
    do 
        echo $col1 $col2 $col3 $col4 $col5
        let startDate=`epoch $col2`
        let endDate=`epoch_end_date $col2 $col3`
        { echo ":param personId=>$col1"; echo ":param startDate=>$startDate"; echo ":param endDate=>$endDate"; echo ":param countryXName: '$col4'"; echo ":param countryYName: '$col5'"; cat ./queries/interactive-complex-3.cypher; echo ";"; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
        ((i++))
    done
} < <(head -n 4 ~/git/snb/ldbc_snb_datagen/substitution_parameters-sf1/interactive_3_param.txt)

{
    read
    while IFS='|' read -r col1 col2 col3
    do 
        echo $col1 $col2 $col3
        let startDate=`epoch $col2`
        let endDate=`epoch_end_date $col2 $col3`
        { echo ":param personId=>$col1"; echo ":param startDate=>$startDate"; echo ":param endDate=>$endDate"; cat ./queries/interactive-complex-4.cypher; echo ";"; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
        ((i++))
    done
} < <(head -n 4 ~/git/snb/ldbc_snb_datagen/substitution_parameters-sf1/interactive_4_param.txt)

The outputs of these commmands match with the relevant lines of validation_params.csv:

["com.ldbc.driver.workloads.ldbc.snb.interactive.LdbcQuery3",17592186055119,"Laos","Scotland",1306886400000,42,20]|[[8796093029689,"Eun-Hye","Yoon",1,1,2]]
["com.ldbc.driver.workloads.ldbc.snb.interactive.LdbcQuery3",8796093030404,"Uruguay","Scotland",1298937600000,28,20]|[]
["com.ldbc.driver.workloads.ldbc.snb.interactive.LdbcQuery3",2199023266354,"Scotland","Slovakia",1301616000000,30,20]|[]
["com.ldbc.driver.workloads.ldbc.snb.interactive.LdbcQuery4",21990232559429,1335830400000,37,10]|[["Hassan_II_of_Morocco",2],["Appeal_to_Reason",1],["Principality_of_Littoral_Croatia",1],["Rivers_of_Babylon",1],["Van_Morrison",1]]
["com.ldbc.driver.workloads.ldbc.snb.interactive.LdbcQuery4",24189255814337,1343779200000,29,10]|[["Ehud_Olmert",3],["Be-Bop-A-Lula",1],["Kingdom_of_Sardinia",1],["The_Singles:_The_First_Ten_Years",1]]
["com.ldbc.driver.workloads.ldbc.snb.interactive.LdbcQuery4",2199023261318,1343779200000,29,10]|[["Ehud_Olmert",3],["Be-Bop-A-Lula",1],["Kingdom_of_Sardinia",1],["The_Singles:_The_First_Ten_Years",1]]
djnakabaale commented 4 years ago

I've tried it for one set of params per query.

$ { echo ":param personId=>21990232559429"; echo ":param startDate=>20120501000000000"; echo ":param endDate=>20120607000000000"; cat ./queries/interactive-complex-4.cypher; echo ";"; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
+------------------------------------------------+
| tagName                            | postCount |
+------------------------------------------------+
| "Hassan_II_of_Morocco"             | 2         |
| "Appeal_to_Reason"                 | 1         |
| "Principality_of_Littoral_Croatia" | 1         |
| "Rivers_of_Babylon"                | 1         |
| "Van_Morrison"                     | 1         |
+------------------------------------------------+
{ echo ":param personId=>21990232559429"; echo ":param startDate=>20120501000000000"; echo ":param endDate=>20120607000000000"; echo ":param countryXName: 'Belgium'"; echo ":param countryYName: 'Netherlands'"; cat ./queries/interactive-complex-3.cypher; echo ";"; } | neo4j-server/bin/cypher-shell --format verbose -u neo4j -p admin
+-----------------------------------------------------------------------------+
| personId       | personFirstName | personLastName | xCount | yCount | count |
+-----------------------------------------------------------------------------+
| 8796093028172  | "Karan"         | "Khan"         | 1      | 1      | 2     |
| 13194139537010 | "Rahul"         | "Rao"          | 1      | 1      | 2     |
+-----------------------------------------------------------------------------+

Great, thanks again @szarnyasg. I have similar values when I use the parameters above. I also seem to be getting consistent results for queries 3 and 4 on scale factor one, for both cypher an postgres. I am doing more experiments with scale factors 10 and 30, and I hope to check-in back here as I get results.

szarnyasg commented 4 years ago

Great, I hope the queries work on larger SFs as well. Feel free to reopen this issue if they don't work.