vagran / dxf-viewer

DXF 2D viewer written in JavaScript
Mozilla Public License 2.0
290 stars 86 forks source link

Let's use a parser other than dxf-parser. #95

Open dotoritos-kim opened 7 months ago

dotoritos-kim commented 7 months ago

The dxf-parser currently applied parses the data unreliably, which may result in missing data. And the dxf-parser repository seems to be dead.

To solve that problem, I created a better, typesafe parser and created a repository. How about trying this parser?

dxf-json repo dxf-json npm

vagran commented 7 months ago

Yes, I agree that current parser is not good. I actually realized that, when created the project, and it supposed to be a temporal solution. For that reason, it is not used from its repo, dxf-viewer has its copy which is maintained according to our needs and has significantly diverged from the original repo, so you may assume it is now unrelated to the original repo state.

For current library version, there is no so strong reasons to switch the parser. Besides estetical part, there are a couple of problems I would like to solve, but after a quick look at your parser I see, you have conceptually everything the same as in dxf-parser. First, I need truly streaming parser - it should parse stream as it is read. I see that your code accumulates all the file in one string before actually parsing it. JS typically has a limit for string size which I have hit for some DXF files. They easily can be huge, tens of gigabytes when embedded objects are used. So it is unacceptable to buffer all the file in a string and parsing it after that. Also I want streaming parser API - enities should be emitted once parsed, so that actual document is built by the parser user code (by the viewer), not by the parser itself, to generate its internal structures instead of the generic JSON. In future version of the library I want to make option for preserving parsed structure of the document for those who need, and not to waste memory for others. Another probvem is with text encoding handling. DXF file may have encoding specified in it HEADER section, so it is not known until the corresponding group is parsed. ideally, the parser should somehow handle this. So I have quite specific requirements for the parser, which I believe better be implemented by myself to keep it in line with my thoughts.

Finally, type-safety does not give any benefits until the project itself is switched to TS. All this stuff was planned for future major version, however I currently have drastical lack of free time to spend for this project. Work was actually started in v1.1.x branch, you can take a look for parsing approach used there (which is just started and work in progress, but gives some feeling).

dotoritos-kim commented 7 months ago

You want to process large files right away rather than splitting them up and leaving them in memory, right?

vagran commented 7 months ago

Yes, it should be processed as it is read, line by line.

Phryxia commented 6 months ago

@dotoritos-kim @vagran

I'm also (sometimes) contributing that parser lib, so maybe I can help. Actually I almost redesign them. I understood your requirements, and I also agree for such direction. Since current parser sometimes lookup previous token, note that it must hold some constant(O(1)) buffer, but not definitely entire files(O(n)).


But to be clear, I'd like to check the meaning of your following sentence:

Also I want streaming parser API - enities should be emitted once parsed, so that actual document is built by the parser user code (by the viewer), not by the parser itself, to generate its internal structures instead of the generic JSON.

Is this means that you want something like following:

const it = parseDxf(streamString)
let res = it.next()

while (!res.done) {
  const partialEntityOrHeaderOrWhatever = res.value
  res = it.next()
}

and you don't want like following

const entireThings = parseDxf(streamString)

right?

If it's right, then there is a one problem. DXF specification DOES have some nested context. For example, when parsing ENTITIES section, each parsing attempts generate each ENTITY, but they're also part of ENTITIES. In that case, what res.value should be? How about a single boundary path of HATCH? Should res.value be BoundaryPath? or HatchEntity?

Each user have their own schema (which can't be determined during library development), and they require different context. Therefore it's not possible to 'split' the output of the stream exactly. And most of other languages (like JavaScript, or even simple toy grammar of algebraic syntax like x + 2 * (3 - y)) also have same problem.

But following alternatives may work

group1 group2 group3
^
|
accumulated object: { header: [header1] }

group1 group2 group3
       ^
       |
accumulated object: { header: [header1, { someNestedHeaderProperty }] }
vagran commented 6 months ago

Is this means that you want something like following:

Yes, something like this.

Each user have their own schema (which can't be determined during library development), and they require different context.

Yes, that's true, the context is needed indeed, and can be, for example, provided by the user, returning context object from his entity callback, and being passed in nested entities callback invocations. The user decides what should be stored in this context, and what can be discarded. Typically, it can be its representation of the corresponding entity. Alternatively, it can be some parsing-state-related structure needed only during the parsing stage.

I would recommend you to not bother with some new parser, because it is quite a low chance that I will accept some 3rd-party implementation of the parser, because it will anyway require significant integration and testing work with no clear benefit, which I want to avoid in 1.0.x branch. Also, I do not want it to be just a TS rewrite of the same old dxf-parser, besides discussed above I have many ideas for its implementation. What I have started in 1.1.x is a generic code-first-grammar-based parser which is supposed to be very generic and most probably be released as a separate library, with a DXF parser later be built on top of it. I am trying to achieve something similar I have done very long time ago in Java, here is how the usage looks in code.

Phryxia commented 6 months ago

for example, provided by the user, returning context object from his entity callback, and being passed in nested entities callback invocations.

Well I'm not a big fan of such approach. And I'm not sure it really has significant performance difference. And receving yet another user defined schema makes difficulty extremly hard as hell. DXF specification changes merely, there is no reason to support generic structure because user still can reduce our full sized parser's result, and can throw it out from memory with respect to their own purpose. (And they can even do that for each steps, since almost every datum are independent to previous one)

I would recommend you to not bother with some new parser, because it is quite a low chance that I will accept some 3rd-party implementation of the parser, because it will anyway require significant integration and testing work with no clear benefit, which I want to avoid in 1.0.x branch

I couldn't understand what this exactly means. Do you mean to implement your own parser?

vagran commented 6 months ago

Do you mean to implement your own parser?

Yes, there is some work already started in this repository v1.1.x branch.

Phryxia commented 6 months ago

Yes, there is some work already started in this repository v1.1.x branch.

Oh I though this repository was dotoritos's one- but here it was dxf-viewer! Nevermind what I've said :p I mean my opinion was about maintaining dxf-json. Anyway thank you for lots of ideas! Though I didn't fully agree with your objectives, I think it would be great if it can be achieved. I'm looking forward to your implementation for future :)