Help parsing large file

h4ck3rm1k3 commented 6 years ago

Hi there, I am working on parsing a large turtle file, ideally I would like to turn it into an equivalent haskell program. I have been profiling the read function and see a growth over time in memory and other things :+1:

For 30k lines of the file I got these stats from rdf4h-3.0.1 release from stack.

        total alloc = 29,235,026,136 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                        SRC                                                    %time %alloc

>>=              Text.Parsec.Prim              Text/Parsec/Prim.hs:202:5-29                            17.4    7.1
satisfy          Text.Parsec.Char              Text/Parsec/Char.hs:(140,1)-(142,71)                    16.2   32.7
noneOf.\         Text.Parsec.Char              Text/Parsec/Char.hs:40:38-52                            14.3    0.0

We can see that a large amount of memory and time is spent in the parsec. I am wondering the following :

can we parse this data incrementally ? Would it make sense to read this in line by line and feed that to the parser or something?
can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.
will attoparsec help?

Examples of the files are here: https://gist.github.com/h4ck3rm1k3/e1b4cfa58c4dcdcfc18cecab013cc6c9

robstewart57 commented 6 years ago

Hi @h4ck3rm1k3 ,

Thanks for the report!

will attoparsec help?

If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 in November.

Try something like:

parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"

Does that improve the memory performance?

can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.

Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:

The schema for each ontology used in a Turtle file? E.g. if the friend of a friend ontology is used, then foaf:homepage predicate would be turned into a Haskell type? For this, have you looked at type providers? It's that sort of thing, i.e. turning a closed world schema into types. F# has them, Haskell doesn't.
Turning Turtle data into types? I'm not sure how that'd work, or why turning ontological instances (data as triples) into Haskell types would a useful thing to do, or what it'd look like.

I'm interested to know If attoparsec (above) gives you better results.

h4ck3rm1k3 commented 6 years ago

Yes, I have downloaded the git repo and am looking. I am interested in converting the data into types based on the schema or ontology I provide, for now I will create a custom one, but basically I want to call constructors of different forms based on the data in the rdf.

On Tue, Sep 19, 2017 at 8:05 AM, Rob Stewart notifications@github.com wrote:

Hi @h4ck3rm1k3 https://github.com/h4ck3rm1k3 ,

Thanks for the report!

will attoparsec help?

If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 https://github.com/axman6 in November.

Try something like:

parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"

Does that improve the memory performance?

can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.

Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:

1.

The schema for each ontology used in a Turtle file? E.g. if the friend of a friend ontology is used, then foaf:homepage predicate would be turned into a Haskell type? For this, have you looked at type providers? It's that sort of thing, i.e. turning a closed world schema into types. F# has them, Haskell doesn't. 2.

Turning Turtle data into types? I'm not how that'd work, or why turning instances (triples) into Haskell types would a useful thing to do.

I'm interested to know If attoparsec (above) gives you better results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/robstewart57/rdf4h/issues/44#issuecomment-330517723, or mute the thread https://github.com/notifications/unsubscribe-auth/AACIV6z0YtZOC9BM6-wG9cF0TPj32Ku6ks5sj62egaJpZM4PcERR .

-- James Michael DuPont

h4ck3rm1k3 commented 6 years ago

Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished. Like the sax model in xml parsing, then I could do my processing before the file is completed.

h4ck3rm1k3 commented 6 years ago

Test with of normal vs attoparse 30k lines we are still hovering around 0.5 seconds per 1k lines, The memory usage has gone down. That is still not very fast. I think next I want to look into some callback function. These are both with NTriplesParserCustom

    Thu Sep 21 06:51 2017 Time and Allocation Profiling Report  (Final)

           gcc-haskell-exe +RTS -N -p -h -RTS

        total time  =       14.89 secs   (14886 ticks @ 1000 us, 1 processor)
        total alloc = 28,746,934,240 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                         SRC                                                    %time %alloc

satisfy          Text.Parsec.Char               Text/Parsec/Char.hs:(140,1)-(142,71)                    13.1   21.4
>>=              Text.Parsec.Prim               Text/Parsec/Prim.hs:202:5-29                            11.4   13.5
mplus            Text.Parsec.Prim               Text/Parsec/Prim.hs:289:5-34                             6.5    9.7
parsecMap.\      Text.Parsec.Prim               Text/Parsec/Prim.hs:190:7-48                             6.5   11.4
isSubDelims      Network.URI                    Network/URI.hs:355:1-38                                  4.4    0.0
fmap.\           Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(171,7)-(172,42)       4.1    3.1
isGenDelims      Network.URI                    Network/URI.hs:352:1-34                                  3.7    0.0
>>=.\.succ'      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76              3.5    1.1
encodeChar       Codec.Binary.UTF8.String       Codec/Binary/UTF8/String.hs:(50,1)-(67,25)               3.1    4.6
encodeString     Codec.Binary.UTF8.String       Codec/Binary/UTF8/String.hs:37:1-53                      2.3    4.0
concat.ts'       Data.Text                      Data/Text.hs:902:5-34                                    2.0    2.6

Testing with lastest version of rdf4h

        Thu Sep 21 06:34 2017 Time and Allocation Profiling Report  (Final)

           gcc-haskell-exe +RTS -N -p -h -RTS

        total time  =       15.28 secs   (15282 ticks @ 1000 us, 1 processor)
        total alloc = 33,815,423,648 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                        SRC                                                    %time %alloc

satisfy          Text.Parsec.Char              Text/Parsec/Char.hs:(140,1)-(142,71)                    17.2   27.6
>>=              Text.Parsec.Prim              Text/Parsec/Prim.hs:202:5-29                            16.5   22.8
parsecMap.\      Text.Parsec.Prim              Text/Parsec/Prim.hs:190:7-48                             9.2    8.4
mplus            Text.Parsec.Prim              Text/Parsec/Prim.hs:289:5-34                             7.7    9.5
isSubDelims      Network.URI                   Network/URI.hs:355:1-38                                  3.9    0.0
isGenDelims      Network.URI                   Network/URI.hs:352:1-34                                  3.4    0.0
encodeChar       Codec.Binary.UTF8.String      Codec/Binary/UTF8/String.hs:(50,1)-(67,25)               2.9    3.9
encodeString     Codec.Binary.UTF8.String      Codec/Binary/UTF8/String.hs:37:1-53                      2.2    3.4
parserReturn.\   Text.Parsec.Prim              Text/Parsec/Prim.hs:234:7-30                             2.0    3.1

robstewart57 commented 5 years ago

Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished

Agree that this would be a good feature, moving towards generating on-the-fly streams of RDF triples whilst parsing, rather than parsing a file/string in entirety.

For example, from the API in the io-streams library, I can imagine to read an RDF source, we'd have a new type class:

class RdfParserStream p where
  parseStringStream
      :: (Rdf a)
      => p
      -> Text
      -> Either ParseFailure (InputStream (RDF a))
  parseFileStream
      :: (Rdf a)
      => p
      -> String
      -> IO (Either ParseFailure (InputStream (RDF a)))
  parseURLStream
      :: (Rdf a)
      => p
      -> String
      -> IO (Either ParseFailure (InputStream (RDF a)))

Then these triple streams could be connected to an output stream, e.g. a file output stream, using the io-streams API:

connect :: InputStream a -> OutputStream a -> IO ()

h4ck3rm1k3 commented 5 years ago

The big question I have for rdf and haskel is how to create instances of types from rdf data, is there any easy way to map rdf data via some ontology into haskell types?

robstewart57 commented 5 years ago

@h4ck3rm1k3 sadly not, although that would be very cool.

There is some work in this area, for other languages including F# and Idris:

And also in Scala, where they have support for type providers from RDF data: https://github.com/travisbrown/type-provider-examples

robstewart57 / rdf4h

Help parsing large file #44