Are evaluation results printed at the end for dev sets or test sets?

toufiglu commented 1 year ago

Hi I understand that the F1 scores for each dependency below are for the development data, but are the evaluation results printed at the end for the test set? Or it is still for the dev set? Maybe I should use --score_test when training to get the test score?

2023-05-14 19:58:06 INFO: Running parser in predict mode
2023-05-14 19:58:06 INFO: Loading model from: saved_models/depparse/fr_goldgsd_parser.pt
2023-05-14 19:58:07 DEBUG: Loaded pretrain from /Users/yiminglu/stanza_resources/fr/pretrain/gsd.pt
2023-05-14 19:58:07 INFO: Loading data with batch size 5000...
2023-05-14 19:58:10 DEBUG: 8 batches created.
2023-05-14 19:58:10 INFO: Start evaluation...
2023-05-14 19:58:45 INFO: F1 scores for each dependency:
  Note that unlabeled attachment errors hurt the labeled attachment scores
         acl: p 0.0000 r 0.0000 f1 0.0000 (447 actual)
   acl:relcl: p 0.0000 r 0.0000 f1 0.0000 (293 actual)
       advcl: p 0.0000 r 0.0000 f1 0.0000 (308 actual)
 advcl:cleft: p 0.0000 r 0.0000 f1 0.0000 (23 actual)
      advmod: p 0.2787 r 0.0272 f1 0.0496 (1250 actual)
        amod: p 0.3534 r 0.5379 f1 0.4266 (1911 actual)
       appos: p 0.0000 r 0.0000 f1 0.0000 (592 actual)
         aux: p 0.0000 r 0.0000 f1 0.0000 (67 actual)
    aux:caus: p 0.0000 r 0.0000 f1 0.0000 (20 actual)
    aux:pass: p 0.0000 r 0.0000 f1 0.0000 (291 actual)
   aux:tense: p 0.0000 r 0.0000 f1 0.0000 (337 actual)
        case: p 0.6563 r 0.9080 f1 0.7619 (5345 actual)
          cc: p 0.0000 r 0.0000 f1 0.0000 (990 actual)
       ccomp: p 0.0000 r 0.0000 f1 0.0000 (118 actual)
        conj: p 0.0000 r 0.0000 f1 0.0000 (1235 actual)
         cop: p 0.3501 r 0.5105 f1 0.4153 (478 actual)
       csubj: p 0.0000 r 0.0000 f1 0.0000 (21 actual)
  csubj:pass: p 0.0000 r 0.0000 f1 0.0000 (2 actual)
         dep: p 0.0000 r 0.0000 f1 0.0000 (5 actual)
    dep:comp: p 0.0000 r 0.0000 f1 0.0000 (1 actual)
         det: p 0.9567 r 0.9614 f1 0.9591 (5469 actual)
   discourse: p 0.0000 r 0.0000 f1 0.0000 (74 actual)
  dislocated: p 0.0000 r 0.0000 f1 0.0000 (12 actual)
        expl: p 0.0000 r 0.0000 f1 0.0000 (14 actual)
   expl:pass: p 0.0000 r 0.0000 f1 0.0000 (55 actual)
     expl:pv: p 0.0000 r 0.0000 f1 0.0000 (102 actual)
   expl:subj: p 0.0000 r 0.0000 f1 0.0000 (86 actual)
       fixed: p 0.0000 r 0.0000 f1 0.0000 (342 actual)
flat:foreign: p 0.0000 r 0.0000 f1 0.0000 (71 actual)
   flat:name: p 0.0270 r 0.0015 f1 0.0028 (677 actual)
    goeswith: p 0.0000 r 0.0000 f1 0.0000 (8 actual)
        iobj: p 0.0000 r 0.0000 f1 0.0000 (113 actual)
  iobj:agent: p 0.0000 r 0.0000 f1 0.0000 (1 actual)
        mark: p 0.1354 r 0.0213 f1 0.0368 (611 actual)
        nmod: p 0.3715 r 0.6471 f1 0.4720 (3307 actual)
       nsubj: p 0.2175 r 0.5246 f1 0.3075 (1748 actual)
  nsubj:caus: p 0.0000 r 0.0000 f1 0.0000 (9 actual)
  nsubj:pass: p 0.0000 r 0.0000 f1 0.0000 (320 actual)
      nummod: p 0.0000 r 0.0000 f1 0.0000 (332 actual)
         obj: p 0.0000 r 0.0000 f1 0.0000 (1186 actual)
   obj:agent: p 0.0000 r 0.0000 f1 0.0000 (9 actual)
     obj:lvc: p 0.0000 r 0.0000 f1 0.0000 (46 actual)
         obl: p 0.0000 r 0.0000 f1 0.0000 (91 actual)
   obl:agent: p 0.0000 r 0.0000 f1 0.0000 (137 actual)
     obl:arg: p 0.0000 r 0.0000 f1 0.0000 (765 actual)
     obl:mod: p 0.1543 r 0.3585 f1 0.2158 (1431 actual)
      orphan: p 0.0000 r 0.0000 f1 0.0000 (17 actual)
   parataxis: p 0.0000 r 0.0000 f1 0.0000 (129 actual)
       punct: p 0.2146 r 0.2864 f1 0.2454 (3802 actual)
  reparandum: p 0.0000 r 0.0000 f1 0.0000 (3 actual)
        root: p 0.5971 r 0.5975 f1 0.5973 (1575 actual)
    vocative: p 0.0000 r 0.0000 f1 0.0000 (27 actual)
       xcomp: p 0.0000 r 0.0000 f1 0.0000 (401 actual)
2023-05-14 19:58:48 INFO: LAS   MLAS    BLEX
2023-05-14 19:58:48 INFO: 48.52 29.56   34.10
2023-05-14 19:58:48 INFO: Parser score:
2023-05-14 19:58:48 INFO: fr_goldgsd 48.52
2023-05-14 19:58:51 INFO: Finished running dev set on
UD_French-goldgsd
  UAS   LAS  CLAS  MLAS  BLEX
67.50 48.52 34.10 29.56 34.10

AngledLuffa commented 1 year ago

That's just for the dev set. You would run with --score_test to get the test set. We could conceivably change that in the future...

Something seems really quite off with the model, unless that's just an early result - it is not producing anything other than a few relation types

On Mon, May 15, 2023 at 6:53 AM Mark Yiming Lu @.***> wrote:

Hi I understand that the F1 scores for each dependency below are for the development data, but are the evaluation results printed at the end for the test set? Or it is still for the dev set? Maybe I should use --score_test when training to get the test score?

2023-05-14 19:58:06 INFO: Running parser in predict mode 2023-05-14 19:58:06 INFO: Loading model from: saved_models/depparse/fr_goldgsd_parser.pt 2023-05-14 19:58:07 DEBUG: Loaded pretrain from /Users/yiminglu/stanza_resources/fr/pretrain/gsd.pt 2023-05-14 19:58:07 INFO: Loading data with batch size 5000... 2023-05-14 19:58:10 DEBUG: 8 batches created. 2023-05-14 19:58:10 INFO: Start evaluation... 2023-05-14 19:58:45 INFO: F1 scores for each dependency: Note that unlabeled attachment errors hurt the labeled attachment scores acl: p 0.0000 r 0.0000 f1 0.0000 (447 actual) acl:relcl: p 0.0000 r 0.0000 f1 0.0000 (293 actual) advcl: p 0.0000 r 0.0000 f1 0.0000 (308 actual) advcl:cleft: p 0.0000 r 0.0000 f1 0.0000 (23 actual) advmod: p 0.2787 r 0.0272 f1 0.0496 (1250 actual) amod: p 0.3534 r 0.5379 f1 0.4266 (1911 actual) appos: p 0.0000 r 0.0000 f1 0.0000 (592 actual) aux: p 0.0000 r 0.0000 f1 0.0000 (67 actual) aux:caus: p 0.0000 r 0.0000 f1 0.0000 (20 actual) aux:pass: p 0.0000 r 0.0000 f1 0.0000 (291 actual) aux:tense: p 0.0000 r 0.0000 f1 0.0000 (337 actual) case: p 0.6563 r 0.9080 f1 0.7619 (5345 actual) cc: p 0.0000 r 0.0000 f1 0.0000 (990 actual) ccomp: p 0.0000 r 0.0000 f1 0.0000 (118 actual) conj: p 0.0000 r 0.0000 f1 0.0000 (1235 actual) cop: p 0.3501 r 0.5105 f1 0.4153 (478 actual) csubj: p 0.0000 r 0.0000 f1 0.0000 (21 actual) csubj:pass: p 0.0000 r 0.0000 f1 0.0000 (2 actual) dep: p 0.0000 r 0.0000 f1 0.0000 (5 actual) dep:comp: p 0.0000 r 0.0000 f1 0.0000 (1 actual) det: p 0.9567 r 0.9614 f1 0.9591 (5469 actual) discourse: p 0.0000 r 0.0000 f1 0.0000 (74 actual) dislocated: p 0.0000 r 0.0000 f1 0.0000 (12 actual) expl: p 0.0000 r 0.0000 f1 0.0000 (14 actual) expl:pass: p 0.0000 r 0.0000 f1 0.0000 (55 actual) expl:pv: p 0.0000 r 0.0000 f1 0.0000 (102 actual) expl:subj: p 0.0000 r 0.0000 f1 0.0000 (86 actual) fixed: p 0.0000 r 0.0000 f1 0.0000 (342 actual) flat:foreign: p 0.0000 r 0.0000 f1 0.0000 (71 actual) flat:name: p 0.0270 r 0.0015 f1 0.0028 (677 actual) goeswith: p 0.0000 r 0.0000 f1 0.0000 (8 actual) iobj: p 0.0000 r 0.0000 f1 0.0000 (113 actual) iobj:agent: p 0.0000 r 0.0000 f1 0.0000 (1 actual) mark: p 0.1354 r 0.0213 f1 0.0368 (611 actual) nmod: p 0.3715 r 0.6471 f1 0.4720 (3307 actual) nsubj: p 0.2175 r 0.5246 f1 0.3075 (1748 actual) nsubj:caus: p 0.0000 r 0.0000 f1 0.0000 (9 actual) nsubj:pass: p 0.0000 r 0.0000 f1 0.0000 (320 actual) nummod: p 0.0000 r 0.0000 f1 0.0000 (332 actual) obj: p 0.0000 r 0.0000 f1 0.0000 (1186 actual) obj:agent: p 0.0000 r 0.0000 f1 0.0000 (9 actual) obj:lvc: p 0.0000 r 0.0000 f1 0.0000 (46 actual) obl: p 0.0000 r 0.0000 f1 0.0000 (91 actual) obl:agent: p 0.0000 r 0.0000 f1 0.0000 (137 actual) obl:arg: p 0.0000 r 0.0000 f1 0.0000 (765 actual) obl:mod: p 0.1543 r 0.3585 f1 0.2158 (1431 actual) orphan: p 0.0000 r 0.0000 f1 0.0000 (17 actual) parataxis: p 0.0000 r 0.0000 f1 0.0000 (129 actual) punct: p 0.2146 r 0.2864 f1 0.2454 (3802 actual) reparandum: p 0.0000 r 0.0000 f1 0.0000 (3 actual) root: p 0.5971 r 0.5975 f1 0.5973 (1575 actual) vocative: p 0.0000 r 0.0000 f1 0.0000 (27 actual) xcomp: p 0.0000 r 0.0000 f1 0.0000 (401 actual) 2023-05-14 19:58:48 INFO: LAS MLAS BLEX 2023-05-14 19:58:48 INFO: 48.52 29.56 34.10 2023-05-14 19:58:48 INFO: Parser score: 2023-05-14 19:58:48 INFO: fr_goldgsd 48.52 2023-05-14 19:58:51 INFO: Finished running dev set on UD_French-goldgsd UAS LAS CLAS MLAS BLEX 67.50 48.52 34.10 29.56 34.10

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKAEULTIRK63QZLJB3XGIYMVANCNFSM6AAAAAAYCHNESY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

toufiglu commented 1 year ago

Hi! Yes, I tried that. And thanks so much for replying. I think it would be a great idea to stress this point on https://github.com/stanfordnlp/stanza-train, considering that some people, like me, are new to this. Or maybe it is just me who is really careless.

The actual test scores are even worse than this, but I am trying crosslingual parsing for an unrelated language, so perhaps is fine.

stanfordnlp / stanza

Are evaluation results printed at the end for dev sets or test sets? #1247