Description | Data structure | Annotations | Additional information | License | Contributors
We present a manually annotated dataset for Automatic Term Extraction (ATE) from scientific abstracts pertaining to coastal areas. This corpus comprises 195 abstracts preannotated using three Knowledge Bases (KBs): AGROVOC, GEMET, and TAXREF-RD, and further revised by a human annotator. We only annotated sentences pertaining to the functioning of littoral systems. Out of the 1,960 sentences, 1,149 contain annotated terms. All abstracts are in English. We conduct experiments using state-of-the-art (SOTA) models for ATE.
The IOB annotations for the dataset follow the same format as ACTER. Additionally, a list of all unique terms is provided. Annotations are available for entire abstracts, individual sentences, and only sentences containing annotated terms.
.
├── README.md
└── data
├── annotations
│ ├── sequential_annotations
│ │ ├── iob_annotations
│ │ │ ├── 1985_155.tsv
│ │ │ ├── 1985_336.tsv
│ │ │ └── ...
│ │ └── iob_annotations_sents_wo_empty
│ │ ├── 1985_155_1.tsv
│ │ ├── 1985_155_2.tsv
│ │ └── ...
│ └── unique_annotations_lists
│ └── en_terms.tsv
├── sents_tokenized
│ ├── 1985_155_0.txt
│ ├── 1985_155_1.txt
│ └── ...
├── sents_tokenized_wo_empty
│ ├── 1985_155_1.txt
│ ├── 1985_155_2.txt
│ └── ...
└── texts_tokenized
├── 1985_155.txt
├── 1985_336.txt
└── ...
11 directories, 6610 files
We collected 60,000+ abstracts from Scopus, from 1980 to 2023, containing the terms "coastal area" or "littoral". We randomly selected 195 among them, and used the annotator tool from Agroportal to preannotate terms appearing in three KBs: AGROVOC, a thesaurus on agronomy, GEMET, a general environmental thesaurus, and TAXREF-RD, a French national taxonomical register for fauna, flora and fungus, that covers mainland France and overseas territories. We then manually annotated these abstracts using these pre-annotations, with the INCEpTION annotation tool. We focused only on sentences that informed us on the functioning on the coastal areas, meaning we did not annotate most of the parts that described methods.
<to be added>
<to be added>