huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.94k stars 26.53k forks source link

RuntimeError: grad can be implicitly created only for scalar outputs #6749

Closed aclifton314 closed 3 years ago

aclifton314 commented 4 years ago

System Info

Pop!_OS 20.04 Pytorch: 1.5.1 Transformers: 3.0.2 Tokenizers: 0.8.1rc1 Python: 3.7.6 Pretrained Model: GPT2 Pretrained Tokenizer: GPT2

Question

I'm getting the following error and I'm not sure how to resolve it:

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/misc_tests.py", line 78, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 637, in _training_step
    loss.backward()
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 94, in backward
    grad_tensors = _make_grads(tensors, grad_tensors)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 35, in _make_grads
    raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs

Here's some sample code:

from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer
import torch
from torch.utils.data import Dataset

class SDAbstractsDataset(Dataset):
    def __init__(self, set_type_str):
        if set_type_str == 'train':
            prompt1 = 'We present an update on the results of the Double Chooz experiment. Double Chooz searches for the neutrino mixing angle, θ13, in the three-neutrino mixing matrix via the disappearance of produced by the dual 4.27 GW/th Chooz B Reactors. Here we discuss updated oscillation fit results using both the rate and the shape of the anti-neutrino energy spectrum. In the most recent oscillation analysis we included data with neutron captures on Gadolinium and Hydrogen along with the reactor off data that we collected. This is an important step in our multi-year program to establish the value of θ13.'
            prompt2 = 'The paper covers detailed discussion on novel control system developed for adaptive fluid-based shock-absorbers serving for mitigation of unknown impact excitations. In order to provide complete independence of the control system from the loading conditions, the Hybrid Prediction Control (HPC) was elaborated. The proposed method is an extension of previously introduced kinematic feedback control which ensures optimal path finding, tracking and path update in case of high disturbance or sudden change of loading conditions. Implementation of the presented control system allows to obtain self-adaptive fluid-based absorbers providing robust impact mitigation. In contrast to previously developed methods of Adaptive Impact Absorption, the proposed control strategy does not require prior knowledge of impact excitation or its preliminary identification. The independence of applied control system from parameters of impact loading results in the capability of automatic path correction in the case of disturbance occurrence and re-adaptation to a number of subsequent impacts. The successful operation of the self-adaptive system is investigated with the use of numerical examples involving double-chamber pneumatic shock-absorber equipped with controllable valve. Efficiency of the HPC is proved by comparison with passive absorber as well as device equipped with adaptive and optimal control modules.'
            prompt3 = 'This study aimed to produce biosurfactant from Pseudozyma tsukubaensis using cassava wastewater and an inoculum (biomass) for galactooligosaccharides synthesis from lactose as an integrated system. First, the use of cassava wastewater as a low cost culture medium by P. tsukubaensis to produce biomass and biosurfactant was evaluated and optimized. Then, the microbial cells (biomass) obtained from the optimized process were used to produce galactooligosaccharides from lactose. The optimum conditions for biosurfactant and biomass synthesis were found to be 80% (v/v) of cassava wastewater at 30°C and 200rpm for 48h. The highest concentration of biosurfactant, that is, minimum surface tension value and maximum biomass concentration predicted were experimentally confirmed as 26.87mN/m and 10.5g/L, respectively. The biosurfactant obtained showed good thermal (121°C/1h), pH (2–11) and ionic strength (0–25% NaCl) stability. Excellent emulsifier activity was also verified, suggesting a potential application in enhanced oil recovery. Galactooligosaccharides synthesized by the Kluyveromyces genus have been extensively investigated, however, few studies have reported transgalactosylation ability by other yeast genera. The transgalactosylation activity of the yeast biomass at optimized conditions from 40% (w/w) lactose resulted in galactooligosaccharides production of 73.12g/L and a yield of 18.28% (w/w) at pH 8.0 and 30°C in 24h. This research showed the technical feasibility of an integrated process: biosurfactant and GOS production from P. tsukubaensis, which takes advantage of the remarkable metabolism of this microorganism. To the best of our knowledge, this is the first study reporting the potential of P. tsukubaensis to produce two economical biotechnological products of increase interest as an integrated process.'
            prompt4 = 'Advantages of a fuzzy predictive control algorithm are discussed in the paper. The fuzzy predictive algorithm is a combination of a DMC (Dynamic Matrix Control) algorithm and Takagi–Sugeno fuzzy modeling, thus it inherits advantages of both techniques. The algorithm is numerically effective. It is in fact generalization of the standard DMC algorithm widely used in the industry, thus the existing implementations of the DMC algorithm can be extended using the presented fuzzy approach. A simple and easy to apply method of fuzzy predictive control algorithms synthesis is presented in the paper. It can be easy applied also in the case of Multiple Input Multiple Output (MIMO) control plants. Moreover, information about measured disturbance can be included in the algorithms in an easy way. The advantages of the fuzzy predictive control algorithm are demonstrated in the example control systems of two nonlinear chemical reactors: the first one—with inverse response and the second one—a MIMO plant with time delay.'
            prompt5 = 'BackgroundBack injury is a common place in our society. Up to two-thirds of back injuries have been associated with trunk rotation. However, the torque production ability with a rotated spine and electromyographic activity of trunk muscles in such efforts is poorly understood. Therefore, the objectives of this study are to study torque production capacity of variously rotated and flexed trunk and to measure the EMG of selected trunk muscles in these activities.MethodsNineteen normal young subjects (7 males and 12 females) were recruited. Subjects were stabilized on a posture-stabilizing platform and were instructed to assume a flexed and right rotated posture (20°, 40° and 60° of rotation and 20°, 40° and 60° of flexion) in a random order. The subjects were asked to exert their maximal voluntary contraction in the asymmetric plane of rotation–extension for a period of 5s. The surface EMG of the external and internal obliques, rectus abdominis, latissimus dorsi, erector spinae at the 10th thoracic and 3rd lumbar vertebral levels was recorded bilaterally along with the torque generated.FindingsWhereas the torque generated was significantly affected by both rotation and extension in both genders (P<0.001), the EMG was independent of rotation but affected by flexion in females only (P<0.01). The torques produced by both genders in each of the nine postures was significantly different from each other (P<0.001). The EMG demonstrated a trend of increase with increasing rotation and flexion. The response surfaces of normalized peak EMG of the right external oblique and internal oblique was somewhat similar, indicating a rotator torque and a stabilizing effect. The left latissimus dorsi and right external oblique provided the rotational torque and the right erector spinae provided the extensor effort. Since the rotation–extension was performed in the plane of asymmetry, the effort required the recruitment of muscles involved in left rotation, stability of rotated spine and an extensor effort.InterpretationThe torque production capacity of the human trunk is posture dependent and declines with increasing rotation. However, with increasing rotation and flexion, the magnitude of EMG increases. This implies that with increasing asymmetry, it requires more muscle effort (thus tissue stress) to generate less torque. Increasing asymmetry tends to weaken the system and may enhance chances of injury.'
            prompt6 = 'Orthogonal frequency division multiplexing (OFDM) is a promising candidate for light emitting diode (LED)-based optical wireless communication (OWC); however, precise channel estimation is required for synchronization and equalization. In this work, we study and discover that the channel response of the white-lightLED-based OWC was smooth and stable. Hence we propose and demonstrate using a specific and adaptive arrangement of grid-type pilot scheme to estimate the LED OWC channel response. Experimental results show that our scheme can achieve better transmission performance and with some transmission capacity enhancement when compared with the method using training-symbol scheme (also called block-type pilot scheme).'
            prompt7 = 'The catalytic activities of three nano-sized nickel catalysts Ni/Y2O3, Ni/La2O3 and Ni/Al2O3, using nickel oxalate as precursor and by impregnation–decomposition–reduced method, have been investigated for the reactions of steam reforming of ethanol at low temperature. Properties of structure and surface of catalysts were tested by XRD, XPS, XES, SEM and BET area. The initial reaction kinetics of ethanol over the catalysts was studied by steady-state reaction and a first-order reaction with respect to ethanol was found. It is found that the catalysts Ni/Y2O3 and Ni/La2O3 exhibit relative high activity for ethanol steam reforming at 250∘C with a conversion of ethanol of 81.9% and 80.7%, and a selectivity of hydrogen of 43.1% and 49.5%, respectively. When temperature reached 320∘C, the conversion of ethanol increased to 93.1% and 99.5% and the selectivity of hydrogen was 53.2% and 48.5%, respectively. The catalyst Ni/Al2O3 exhibits relative lower activity for ethanol steam reforming and hydrogen selectivity. However, the three catalysts all have long-term stability for ethanol steam reforming.'
            prompt8 = 'Exergetic and exergoeconomic analyses are often used to evaluate the performance of energy systems from the thermodynamic and economic points of view. While a conventional exergetic analysis can be used to recognize the sources of inefficiencies, the so-called advanced exergy-based analysis is convenient for identifying the real potential for thermodynamic improvements and the system component interactions by splitting the exergy destruction and the total operating cost within each component into endogenous/exogenous and unavoidable/avoidable parts. In this study for the first time an advanced exergoeconomic analysis is applied to a gas-engine-driven heat pump (GEHP) drying system used in food drying for evaluating its performance along with each component. The advanced exergoeconomic analysis shows that the unavoidable part of the exergy destruction cost rate within the components of the system is lower than the avoidable part. The most important components based on the total avoidable costs are drying ducts, the condenser and the expansion valve. The inefficiencies within the condenser could particularly be improved by structural improvements of the whole system and the remaining system components. Finally, it can be concluded that the internal design changes play a more essential role in determining the cost of each component.'
            self.data_list = [prompt1, prompt2, prompt3, prompt4, prompt5, prompt6, prompt7, prompt8]

        else:
            prompt1 = 'State estimation (SE) is well-established at the transmission system level of the electricity grid, where it has been in use for the last few decades and is a most vital component of energy management systems employed in the monitoring and control centers of electric transmission systems. However, its use for the monitoring and control of power distribution systems (DSs) has not yet been widely implemented because DSs have been majorly passive with uni-directional power flows. This scenario is now changing with the advent of smart grid, which is changing the nature of electric distribution networks by embracing more dispersed generation, demand responsive loads, and measurements devices with different data rates. Thus, the development of distribution system state estimation (DSSE) tool is inevitable for the implementation of protection, optimization, and control techniques, and various other features envisioned by the smart grid concept. Due to the inherent characteristics of DS different from those of transmission systems, transmission system state estimation (TSSE) is not applicable directly to DSs. This paper is an attempt to present the state-of-the-art on DSSE as an enabler function for smart grid features. It broadly reviews the development of DSSE, challenges faced by its development, and various DSSE algorithms. Additionally, it identifies some future research lines for DSSE.'
            prompt2 = 'A solar-assisted absorption heat transformer (SAAHT) is a useful substitute for the conventional equipment to generate low-pressure steam. A 100kW steam generation system with a SAAHT in Langfang (China) is evaluated in this study. Hourly thermodynamic performance, including system efficiency, exergy efficiency, CO2 emission reduction rate and output heat in typical days in four seasons, is discussed. Results show that ambient temperature has a smaller effect on system performance than solar irradiation. In any one of the typical days in spring, summer and autumn, the system presents higher output heat and CO2 emission reduction rate, more stable system efficiency and exergy efficiency than those in winter. Comparative results from two methods show that ratio method has higher system efficiency with solar irradiation below 600W/m2. A hybrid method combining both the degree method and ratio method is adopted to work with the off-design condition, and results show that performance improvement for system is not so obvious as that in solo absorption heat transformer. Among the four typical days, the most obvious improvement occurs in summer with cumulative output heat increasing from 1318kWh to 1343kWh, and the CO2 emission reduction increasing from 296kg to 301kg.'
            prompt3 = 'The European Commission is encouraging the Cement, Lime and Magnesium Oxide Manufacturing Industries to reutilize collected particulate matter or wastes in the emission control of SO2 with a 100% removal efficiency. Following this directive, three different by-products from the calcination of natural magnesite were selected in order to evaluate their desulfurization capacity. The saturation time, defined as the time for the total neutralization of SO2 was used to determine consumption values at laboratory scale with 100% removal efficiency. The by-product LG-MgO (∼68% MgO) presented the lowest consumption value, with 2.9kg per m3 of SO2, three times the corresponding to the widely used high grade Ca(OH)2. The liquid-to-gas (L/G) ratio was used for comparison to the industry and taking this into account, the final pH range before achieving saturation was 5.1–6.3. The residual solids obtained at the end of the process were mainly composed of unreacted magnesium and calcium compounds and reaction products CaSO4·2H2O and MgSO4·6H2O which can be used as fertilizers. Therefore, the reutilization of these by-products in a wet flue gas desulfurization process is a feasible and sustainable choice that allows extending their life-cycle.'
            prompt4 = 'The GP Joule Group is starting the biggest ‘green’ hydrogen mobility project in Germany so far, close to the northern border with Denmark. The eFarm project will create a modular hydrogen infrastructure, covering production and processing through to utilisation in a number of hydrogen-powered vehicles.'
            self.data_list = [prompt1, prompt2, prompt3, prompt4]

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.data_list[idx]
        return abstract_text

def sd_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = None
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = None
    batch['use_cache'] = True
    return batch

output_dir = 'TMP_DIR'
logging_dir = 'TMP_DIR'
training_args = TrainingArguments(
    output_dir=output_dir,
    logging_dir=logging_dir,
    do_train=True,
    per_device_train_batch_size=2,
    num_train_epochs=1,
)

model = GPT2LMHeadModel.from_pretrained('gpt2')
train_dataset = SDAbstractsDataset('train')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=sd_data_collator
)

trainer.train()
ameasure commented 3 years ago

When the model receives inputs that include the labels, it's supposed to produce a tuple of (loss, predictions), where the loss is a scalar. The trainer then uses the loss to calculate the gradients. In this case (or at least in my case when I get a similar error) the trainer appears to be trying to use the predictions not the loss to calculate the gradient. This appears to be because the model is not receiving the 'labels' as input and so is only producing a one tuple of (predictions). You should be able to fix it by passing a value for "labels" in your collator. See for example transformers.DataCollatorForLanguageModeling.

MojammelHossain commented 3 years ago

For me, I am getting the same error because the model I choose does not return loss even though I pass labels. It's better to check the model documentation you are using whether model forward() return loss or not. This is the snapshot of BertModel (Model which I choose first) forward() returns. Which does not return any loss value. image And this is the snapshot of BertModelLMHeadModel (Model which I choose later) forward() returns. Which return loss value. image

aclifton314 commented 3 years ago

@ameasure @MojammelHossain Thank you both for your feedback! Checking the GPT2 documentation showed me an example of what I could set the labels value to in my collator.