gomutex / godocx

Go library for reading and writing Microsoft Docx
MIT License
60 stars 1 forks source link

Failed to parse japanese style name #20

Closed chr1shung closed 1 day ago

chr1shung commented 5 days ago

I have a use case that will parse and group the text based on their style name. However I couldn't successfully parse the style name in japanese character, here's the sample output of my testing program:

STYLE &{ans-list} // english characger
TEXT: &{The students will have a school trip next week. <nil>}
STYLE &{候選答案} // chinese character
TEXT: &{The students will have lunch at a hotel near Ueno Zoo. <nil>}
STYLE &{a} // kanji, the original text is '選択肢'
TEXT: &{The students will leave Ueno Zoo at two. <nil>}

I'm not sure if it's related to encoding or something. If you know how to fix it I could also help submit a PR, thanks for the help.

gomutex commented 5 days ago

Hi,

can you provide additional information

Environment Details

Sample Code

chr1shung commented 5 days ago

Here's my local environment:

Sample Code:

func main() {
    docx, err := godocx.OpenDocument("example.docx")
    if err != nil {
        log.Fatal(err)
    }

    for _, c := range docx.Document.Body.Children {
        fmt.Println("STYLE", c.Para.Property.Style)
        for _, c2 := range c.Para.Children {
            for _, c3 := range c2.Run.Children {
                fmt.Println("TEXT:", c3.Text)
            }
        }
    }
}
gomutex commented 4 days ago

I created sample docx(with python-docx) to mimick the issue and read it with exact godocx version as mentioned.

It appears working for me.

STYLE &{ans-list} TEXT: &{The students will have a school trip next week. } STYLE &{候選答案} TEXT: &{The students will have lunch at a hotel near Ueno Zoo. } STYLE &{選択肢} TEXT: &{The students will leave Ueno Zoo at two. }


Sample docx generated with Python-docx

chr1shung commented 4 days ago

would you mind testing this document ? The python package is able to parse it correctly while godocx cannot.

gomutex commented 4 days ago

Thank you for the input. I can see the issue. In python-docx, It parses the style id into ParagraphStyle class and gets the details from styles.xml (i.e maps style id 'a' to style name '選択肢'). In godocx, it is just generic struct that contains just style id.

chr1shung commented 4 days ago

Do you think it's a bug and would you fix it ? I'm willing to help if you can pinpoint where should I look into

gomutex commented 3 days ago

I don't believe it's a bug; the current behavior is as intended. I can write a function to retrieve style details based on the style ID by parsing docProps/styles.xml and indexing them by IDs. However, at the moment, I'm prioritizing implementing basic functions and fixes in the library. I'll certainly work on this as soon as possible. Thank you for your understanding.

gomutex commented 1 day ago

v0.1.3-beta.1 has introduced the GetStyle method for paragraph, which can be used to retrieve the style metadata

chr1shung commented 1 day ago

v0.1.3-beta.1 has introduced the GetStyle method for paragraph, which can be used to retrieve the style metadata

What's the styleID I need for the GetStyle(styleID string) method ? I noticed that styleID isn't actually used in that method. I tried passing a random string and then it panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0xe0 pc=0x10412adc4]

goroutine 1 [running]:
github.com/gomutex/godocx/docx.(*RootDoc).GetStyleByID(...)
        /Users/chris/go/pkg/mod/github.com/gomutex/godocx@v0.1.3-beta.1/docx/styles.go:9
github.com/gomutex/godocx/docx.(*Paragraph).GetStyle(0x10412ff4b?, {0x104018b1c?, 0x1400009eed8?})
        /Users/chris/go/pkg/mod/github.com/gomutex/godocx@v0.1.3-beta.1/docx/paragraph.go:209 +0x54
gomutex commented 1 day ago

Apologies. Yes, the styleID is not used and should not be there. I have fixed the nil pointer error also(in develop branch). Can you try the develop branch and check if there are any other bugs?

chr1shung commented 1 day ago

It works great:

STYLE: 解答(記号)
STYLE: ans-list
TEXT: &{The students will have a school trip next week. <nil>}
STYLE: 選択肢
TEXT: &{The students will have lunch at a hotel near Ueno Zoo. <nil>}
STYLE: 候選答案
TEXT: &{The students will leave Ueno Zoo at two. <nil>}
STYLE: ans-list

Thanks a lot for the quick response and fix !

gomutex commented 1 day ago

I have merged the fix into main branch. You can use version v0.1.3-beta.2 (or latest)