thunlp / Few-NERD

Code and data of ACL 2021 paper "Few-NERD: A Few-shot Named Entity Recognition Dataset"
https://ningding97.github.io/fewnerd
Apache License 2.0
385 stars 54 forks source link

Regarding Data-set in inter/intra folder #41

Closed pratikchhapolika closed 2 years ago

pratikchhapolika commented 2 years ago

Inside data/episode-data/inter/ I see lot of training, test and dev data. I may be asking few silly questions , please pardon.

I was exploring train_5_5.jsonl. What does train_5_5.jsonl signifies? Does it has to do anything with support and query set?

Here is one example:

I see support has 14 sentences and Query has 15 sentences..

So, this one example mentioned below : support and query is fed into model as single example? Support is used to train the model? Then why Query is used?? I am seeing this training data structure for the first time. Could you give me more insights in lay man terms how training happens?

{
  "support":
          {"word":
                  [
                    ["averostra", ",", "or", "``", "bird", "snouts", "''", ",", "is", "a", "clade", "that", "includes", "most", "theropod", "dinosaurs", "that", "have", "a", "promaxillary", "fenestra", "(", "``", "fenestra", "promaxillaris", "``", ")", ",", "an", "extra", "opening", "in", "the", "front", "outer", "side", "of", "the", "maxilla", ",", "the", "bone", "that", "makes", "up", "the", "upper", "jaw", "."],
                    ["since", "that", "time", ",", "the", "squadron", "made", "several", "extended", "indian", "ocean", ",", "mediterranean", "sea", ",", "and", "north", "atlantic", "deployments", "as", "part", "of", "cvw-1", "/", "cv-66", ",", "until", "the", "decommissioning", "of", "uss", "``", "america", "''", "in", "1996", "."],
                    ["the", "alpha-gal", "allergy", "is", "believed", "to", "result", "from", "tick", "bites", "."],
                    ["interaction", "was", "shown", "to", "occur", "with", "the", "dna", "-directed", "rna", "polymerase", "ii", "subunit", ",", "rpb1", ",", "of", "rna", "polymerase", "ii", "during", "both", "mitosis", "and", "interphase", "."],
                    ["he", "is", "also", "responsible", "for", "programming", "on", "diablo", "ii", ",", "the", "development", "of", "the", "battle.net", "game", "server", "network", ",", "and", "the", "quake", "2", "mod", "loki", "'s", "minions", "capture", "the", "flag", "."],
                    ["minix", "was", "first", "released", "in", "1987", ",", "with", "its", "complete", "source", "code", "made", "available", "to", "universities", "for", "study", "in", "courses", "and", "research", "."],
                    ["terminal", "island", "is", "a", "low", "snow-covered", "island", "off", "the", "north", "tip", "of", "alexander", "island", ",", "in", "the", "bellingshausen", "sea", "west", "of", "palmer", "land", ",", "antarctic", "peninsula", "."],
                    ["among", "these", "were", "net/one", ",", "3+", ",", "banyan", "vines", "and", "novell", "'s", "ipx", "/", "spx", "."],
                    ["in", "1933\u20131970", ",", "a", "summer", "camp", "on", "south", "bass", "island", "operated", "for", "episcopal", "and", "anglican", "choristers", "."],
                    ["she", "is", "also", "the", "only", "cam", "ship", "whose", "fighter", "pilot", "died", "in", "action", "after", "his", "aircraft", "was", "launched", "from", "the", "ship", "."],
                    ["the", "department", "of", "social", "welfare", "and", "development", "(", "dswd", ")", "has", "distributed", "relief", "goods", "to", "residents", "of", "boracay", "while", "the", "island", "is", "closed", "to", "tourists", "."],
                    ["``", "rainbow", "``", "was", "scrapped", "in", "1940", "."],
                    ["it", "is", "the", "leading", "firm", "for", "the", "charlotte", "douglas", "international", "airport", "airfield", "expansion", ",", "the", "new", "dallas", "fort", "worth", "international", "airport", "southwest", "end-around", "taxiway", ",", "and", "master", "plan", "updates", "at", "philadelphia", "international", "airport", "and", "san", "antonio", "international", "airport", "."],
                    ["the", "event", "held", "at", "solberg-hunterdon", "airport", "is", "the", "largest", "summertime", "hot", "air", "balloon", "festival", "in", "north", "america", "."]
                  ],

          "label":
                  [
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "other-biologything", "other-biologything", "O", "O", "other-biologything", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "product-ship", "O", "product-ship", "O", "O", "O", "O", "O", "product-ship", "product-ship", "product-ship", "product-ship", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "other-biologything", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "other-biologything", "O", "other-biologything", "other-biologything", "other-biologything", "O", "O", "other-biologything", "O", "O", "other-biologything", "other-biologything", "other-biologything", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "O"],
                    ["O", "O", "O", "product-software", "O", "product-software", "O", "product-software", "product-software", "O", "product-software", "product-software", "product-software", "O", "product-software", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "location-island", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "product-ship", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "product-ship", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "O", "building-airport", "building-airport", "building-airport", "building-airport", "O"],
                    ["O", "O", "O", "O", "building-airport", "building-airport", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
                  ]
              },

  "query":
          {"word":
                  [
                    ["the", "final", "significant", "change", "in", "the", "life", "of", "the", "coco", "2", "(", "models", "26-3134b", ",", "26-3136b", ",", "and", "26-3127b", ";", "16", "kb", "standard", ",", "16", "kb", "extended", ",", "and", "64", "kb", "extended", "respectively", ")", "was", "to", "use", "the", "enhanced", "vdg", ",", "the", "mc6847t1", ",", "allowing", "lowercase", "characters", "and", "changing", "the", "text", "screen", "border", "color", "."],
                    ["the", "reno-tahoe", "international", "airport", "reno-tahoe", "international", "airport", "(", "formerly", "known", "as", "the", "reno", "cannon", "international", "airport", ")", "is", "the", "other", "major", "airport", "in", "the", "state", "."],
                    ["it", "was", "built", "by", "cole", "palen", "for", "flight", "in", "his", "weekend", "airshows", "as", "early", "as", "1967", "and", "actively", "flown", "(", "mostly", "by", "cole", "palen", ")", "within", "the", "weekend", "airshows", "at", "old", "rhinebeck", "until", "the", "late", "1980s", "."],
                    ["lambert", "land", "is", "bounded", "in", "the", "north", "by", "the", "nioghalvfjerd", "fjord", ",", "in", "the", "east", "by", "the", "greenland", "sea", "and", "in", "the", "south", "by", "the", "zachariae", "isstrom", "."],
                    ["started", "police", "operations", "with", "4", "cessna", "cu", "206g", "officially", "on", "7", "april", "1980", "with", "operations", "focused", "in", "peninsula", "of", "malaysi", "a", "."],
                    ["mysore", "airport", "is", "away", ",", "followed", "by", "kozhikode", "international", "airport", "at", "and", "bengaluru", "international", "airport", "at", "."],
                    ["the", "egg-shaped", "qaqaarissorsuaq", "island", "is", "located", "in", "tasiusaq", "bay", ",", "in", "the", "central", "part", "of", "upernavik", "archipelago", "."],
                    ["where", "they", "inserted", "nife", "hydrogenase", "into", "polypyrrole", "films", "and", "to", "provide", "proper", "contact", "to", "the", "electrode", ",", "there", "were", "redox", "mediators", "entrapped", "into", "the", "film", "."],
                    ["the", "nt-3", "protein", "is", "found", "within", "the", "thymus", ",", "spleen", ",", "intestinal", "epithelium", "but", "its", "role", "in", "the", "function", "of", "each", "organ", "is", "still", "unknown", "."],
                    ["ted", "insists", "that", "he", "will", "have", "a", "better", "chance", "at", "winning", "since", "the", "guest", "judge", ",", "tv", "presenter", "henry", "sellers", ",", "is", "staying", "at", "the", "craggy", "island", "parochial", "house", "."],
                    ["mdm2", "binds", "and", "ubiquitinates", "p53", ",", "facilitating", "it", "for", "degradation", "."],
                    ["neuraminidase", "inhibitors", "for", "human", "neuraminidase", "(", "hneu", ")", "have", "the", "potential", "to", "be", "useful", "drugs", "as", "the", "enzyme", "plays", "a", "role", "in", "several", "signaling", "pathways", "in", "cells", "and", "is", "implicated", "in", "diseases", "such", "as", "diabetes", "and", "cancer", "."], ["at", "it", "was", "long", "enough", "to", "accommodate", "the", "belle", "steamers", "that", "carried", "trippers", "along", "the", "coast", "at", "that", "time", "."],
                    ["these", "guerrilla", "sub", "missions", "originated", "at", "brisbane", "'s", ",", "capricorn", "wharf", "or", "mios", "woendi", "."],
                    ["because", "it", "was", "originally", "an", "island", "well", "within", "lake", "texcoco", ",", "iztacalco", "was", "settled", "by", "humans", "later", "than", "the", "rest", "of", "the", "valley", "of", "mexico", "."],
                    ["the", "nordic", "countries", "had", "developed", "the", "skerry", "cruiser", "classes", "and", "the", "international", "rule", "classes", "had", "adopted", "in", "1919", "a", "new", "edition", "of", "the", "rule", "which", "was", "not", "yet", "implemented", "in", "the", "countries", "."]
                  ],

          "label": [["O", "O", "O", "O", "O", "O", "O", "O", "O", "product-software", "product-software", "O", "O", "product-software", "O", "product-software", "O", "O", "product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "product-software", "O", "O", "product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "building-airport", "building-airport", "O", "O", "O", "O", "O"], ["location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "location-island", "O", "O"], ["building-airport", "building-airport", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "O", "O", "building-airport", "building-airport", "building-airport", "O", "O"], ["O", "O", "location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "other-biologything", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "O", "O", "O"], ["other-biologything", "O", "O", "O", "other-biologything", "O", "O", "O", "O", "O", "O"], ["other-biologything", "other-biologything", "other-biologything", "other-biologything", "other-biologything", "O", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "product-ship", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "product-ship", "product-ship", "O", "product-ship", "product-ship", "O", "product-ship", "product-ship", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "product-ship", "product-ship", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]]},

  "types": ["other-biologything", "building-airport", "location-island", "product-ship", "product-software"]}
yulinchen99 commented 2 years ago

The goal of few-shot learning is to learn new classes with only a few instances from each class. N-way-K-shot setting is a way to mimic the few-shot setting in episodic manner. So each episode (composed of one support set and one query set) can be considered as a few-shot learning task. For each episode, the model should learn from the support set and predict on the query set. The label of support set is always available (because the model should learn from it), but the label of query set is only available in training and is used for supervision and loss calculation.

train_5_5.json means it contains training data for 5-way-5-shot setting. Each class has 5-10 instances in the support set, and there are 5 classes in one episode.

For more details regarding sampling and training setting, you can consult the paper [https://arxiv.org/pdf/2105.07464.pdf]

pratikchhapolika commented 2 years ago

The goal of few-shot learning is to learn new classes with only a few instances from each class. N-way-K-shot setting is a way to mimic the few-shot setting in episodic manner. So each episode (composed of one support set and one query set) can be considered as a few-shot learning task. For each episode, the model should learn from the support set and predict on the query set. The label of support set is always available (because the model should learn from it), but the label of query set is only available in training and is used for supervision and loss calculation.

train_5_5.json means it contains training data for 5-way-5-shot setting. Each class has 5-10 instances in the support set, and there are 5 classes in one episode.

For more details regarding sampling and training setting, you can consult the paper [https://arxiv.org/pdf/2105.07464.pdf]

Thank you for quick response.

In the test data also we have support and query and both have labels.

But in real scenario, we get only one instance of test-data, say sentence like I want to return this damaged product. Then in this case how should I pass this to trained model?

@cyl628

pratikchhapolika commented 2 years ago

The goal of few-shot learning is to learn new classes with only a few instances from each class. N-way-K-shot setting is a way to mimic the few-shot setting in episodic manner. So each episode (composed of one support set and one query set) can be considered as a few-shot learning task. For each episode, the model should learn from the support set and predict on the query set. The label of support set is always available (because the model should learn from it), but the label of query set is only available in training and is used for supervision and loss calculation.

train_5_5.json means it contains training data for 5-way-5-shot setting. Each class has 5-10 instances in the support set, and there are 5 classes in one episode.

For more details regarding sampling and training setting, you can consult the paper [https://arxiv.org/pdf/2105.07464.pdf]

Also, could you upload inference script to test on our own data-set, and it return the output tags along with metrics.

yulinchen99 commented 2 years ago

42 as for test, refer to this issue